Do we have a problem from bootstrapping nodes not being aware of each other in rack-aware replication strategy?
Background: bootstrap makes the assumption that we can simplify things by treating bootstrap of multiple nodes independently, trading some (potential) extra copying for simplifying the process for recovery if a node fails or is killed during the bootstrap process. A couple examples should illustrate this. Suppose we have nodes A and D in rack unaware mode, replication factor of one (for simplicity). The ranges are then (D-A] for A and (A-D] for D. Nodes B and C then bootstrap between A and D. So we copy (A-B] to B and (A-C] to C. If both bootstraps complete successfully then they will serve (A-B] and (B-C], that is, we transferred (A-B] to C unnecessarily. But, if either bootstrap fails, the remaining bootstrap can ignore that and serve the entire range that was transferred to it. So for rack-unaware bootstrapping it is clear that bootstrap-in-isolation is fine. But what about rack-aware? Recall that in rack-aware mode, we write the first replica to the first node on the ring _in the other data center_, and remaining replicas to nodes in the same. Say we have two nodes A and D, in different DCs, with a replication factor of 2: A / D Node Primary range Replica for A (D-A] (A-D] D (A-D] (D-A] If we add nodes B and C in the same DCs as A and D, respectively, we bootstrap as A,B / C,D B predicts the ring will be Node Primary range Replica for A (D-A] (B-D] B (A-B] D (B-D] (D-A], (A-B] C predicts Node Primary range Replica for A (D-A] (A-C], (C-D] C (A-C] (D-A] D (C-D] And really we end up with Node Primary range Replica for A (D-A] (B-C], (C-D] B (A-B] C (B-C] (D-A], (A-B] D (C-D] So each node does have (a superset of) the right data copied. (Note that C has (A-B] as a replica in the final version, whereas it predicted it would be part of its primary range, but that doesn't matter as long as it ended up w/ the right data on it.) If instead we add B and C both to D's datacenter we have: A / B,C,D Node Primary range Replica for A (D-A] (A-B], (B-D] B (A-B] (D-A] D (B-D] Node Primary range Replica for A (D-A] (A-C], (C-D] C (A-C] (D-A] D (C-D] Node Primary range Replica for A (D-A] (A-B], (B-C], (C-D] B (A-B] (D-A] C (B-C] D (C-D] Again each node ends up with the right data. Are there conditions under which we don't? After playing around with this in my mind I think that there are not, but this is tricky so peer review is welcome. :)