Just attached a draft patch to #562 if somebody is interested to have a look. It outlines how leave coordination could work, although I would be surprised if the code actually works as it has not been tested. I'll improve & test this tomorrow.
Any ideas/comments would be welcome, especially on the subject do people think this approach is the best one or even needed? Cheers, -Jaakko On Tue, Dec 1, 2009 at 1:01 PM, Jaakko <rosvopaalli...@gmail.com> wrote: > I think adding nodes to the ring (that is, bootstrapping) is safe even > if there are multiple such operations happening at the same time. > Bootstrapping causes existing ranges to be fragmented and if we count > pending ranges based on old token metadata, the worst that can happen > (I think) is that pending ranges will be too big and we get data that > is not our responsibility once the other bootstraps finish. I could > not find any scenario where pending ranges did not cover node's > eventual ranges. > > However, nodes booting or leaving when another node is leaving within > the same affected ranges, is another issue entirely (as discussed in > #562 and an earlier mail thread. See especially "2nd option" in #562 > description). Problem is, in this case actual ranges might eventually > be _bigger_ than pending ranges calculated, or be entirely wrong as > one node disappears from the ring. This problem is present regardless > of replication strategy: > > Original cluster (primary range, replica range) > A: E-A, D-E > B: A-B, E-A > D: B-D, A-B > E: D-E, B-D > > C bootstraps in between B and D: > A: E-A, D-E > B: A-B, E-A > C: B-C, A-B > D: C-D, B-C > E: D-E, C-D > > At the same time B leaves: > A: E-A, D-E > D: A-D, E-A > E: D-E, A-D > > Actual final situation: > A: E-A, D-E > C: A-C, E-A > D: C-D, A-C > E: D-E, C-D > > It is easy to see that when C bootstraps, it thinks (based on token > metadata at that time) that it's primary range will be B-C and replica > range A-B. Final situation is A-C / E-A, which is much larger than the > node assumed. > > I think this will need another control channel as gossiping is not > able to solve this. Basically we need to control movement in a way > that If a node is leaving, there must not be any other movement > (leaving or bootstrapping) within the affected ranges. This means > that: > > (1) before a node can leave, it must contact all nodes that are going > to experience range changes and get permission from them. If any of > the other nodes already has pending range changes either due to > another bootstrap or leave operation, the node wanting to leave must > wait. > (2) before a node can bootstrap, it must get permission from the > intended bootstrap source. If there is a leave operation in progress, > bootstrap must wait. > > As also discussed in #562, for automatic load balancing purposes it > would be best to limit the amount of moving nodes (within affected > ranges again) to one, so basically I think we'll need a movement > control channel that makes sure only one move operation is in progress > at certain area of the ring. > > This would of course make autobootstrapping a big cluster a bit > slower, as nodes have to queue when bootstrapping, but I don't think > this is a big problem. It would make all these dark corner cases go > away, and would IMHO be the best thing to do. If this proves to be a > problem (somebody wants to move groups of nodes at the same time or > something like that), we can then polish the mechanism. > > -Jaakko > > > On Tue, Dec 1, 2009 at 4:19 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >> Do we have a problem from bootstrapping nodes not being aware of each >> other in rack-aware replication strategy? >> >> Background: bootstrap makes the assumption that we can simplify things >> by treating bootstrap of multiple nodes independently, trading some >> (potential) extra copying for simplifying the process for recovery if >> a node fails or is killed during the bootstrap process. >> >> A couple examples should illustrate this. >> >> Suppose we have nodes A and D in rack unaware mode, replication factor >> of one (for simplicity). The ranges are then (D-A] for A and (A-D] >> for D. >> >> Nodes B and C then bootstrap between A and D. So we copy (A-B] to B >> and (A-C] to C. If both bootstraps complete successfully then they >> will serve (A-B] and (B-C], that is, we transferred (A-B] to C >> unnecessarily. But, if either bootstrap fails, the remaining >> bootstrap can ignore that and serve the entire range that was >> transferred to it. >> >> So for rack-unaware bootstrapping it is clear that >> bootstrap-in-isolation is fine. But what about rack-aware? >> >> Recall that in rack-aware mode, we write the first replica to the >> first node on the ring _in the other data center_, and remaining >> replicas to nodes in the same. >> >> Say we have two nodes A and D, in different DCs, with a replication >> factor of 2: >> >> A / D >> >> Node Primary range Replica for >> A (D-A] (A-D] >> D (A-D] (D-A] >> >> If we add nodes B and C in the same DCs as A and D, respectively, we >> bootstrap as >> >> A,B / C,D >> >> B predicts the ring will be >> Node Primary range Replica for >> A (D-A] (B-D] >> B (A-B] >> D (B-D] (D-A], (A-B] >> >> C predicts >> Node Primary range Replica for >> A (D-A] (A-C], (C-D] >> C (A-C] (D-A] >> D (C-D] >> >> And really we end up with >> Node Primary range Replica for >> A (D-A] (B-C], (C-D] >> B (A-B] >> C (B-C] (D-A], (A-B] >> D (C-D] >> >> So each node does have (a superset of) the right data copied. (Note >> that C has (A-B] as a replica in the final version, whereas it >> predicted it would be part of its primary range, but that doesn't >> matter as long as it ended up w/ the right data on it.) >> >> If instead we add B and C both to D's datacenter we have: >> >> A / B,C,D >> >> Node Primary range Replica for >> A (D-A] (A-B], (B-D] >> B (A-B] (D-A] >> D (B-D] >> >> Node Primary range Replica for >> A (D-A] (A-C], (C-D] >> C (A-C] (D-A] >> D (C-D] >> >> Node Primary range Replica for >> A (D-A] (A-B], (B-C], (C-D] >> B (A-B] (D-A] >> C (B-C] >> D (C-D] >> >> Again each node ends up with the right data. >> >> Are there conditions under which we don't? >> >> After playing around with this in my mind I think that there are not, >> but this is tricky so peer review is welcome. :) >> >