Bootstrap in rack-aware mode

Jonathan Ellis Mon, 30 Nov 2009 11:19:55 -0800

Do we have a problem from bootstrapping nodes not being aware of each
other in rack-aware replication strategy?


Background: bootstrap makes the assumption that we can simplify things
by treating bootstrap of multiple nodes independently, trading some
(potential) extra copying for simplifying the process for recovery if
a node fails or is killed during the bootstrap process.

A couple examples should illustrate this.

Suppose we have nodes A and D in rack unaware mode, replication factor
of one (for simplicity).  The ranges are then (D-A] for A and (A-D]
for D.

Nodes B and C then bootstrap between A and D.  So we copy (A-B] to B
and (A-C] to C.  If both bootstraps complete successfully then they
will serve (A-B] and (B-C], that is, we transferred (A-B] to C
unnecessarily.  But, if either bootstrap fails, the remaining
bootstrap can ignore that and serve the entire range that was
transferred to it.

So for rack-unaware bootstrapping it is clear that
bootstrap-in-isolation is fine.  But what about rack-aware?

Recall that in rack-aware mode, we write the first replica to the
first node on the ring _in the other data center_, and remaining
replicas to nodes in the same.

Say we have two nodes A and D, in different DCs, with a replication
factor of 2:

A / D

Node    Primary range    Replica for
A       (D-A]            (A-D]
D       (A-D]            (D-A]

If we add nodes B and C in the same DCs as A and D, respectively, we
bootstrap as

A,B / C,D

B predicts the ring will be
Node    Primary range    Replica for
A       (D-A]            (B-D]
B       (A-B]
D       (B-D]            (D-A], (A-B]

C predicts
Node    Primary range    Replica for
A       (D-A]            (A-C], (C-D]
C       (A-C]            (D-A]
D       (C-D]

And really we end up with
Node    Primary range    Replica for
A       (D-A]            (B-C], (C-D]
B       (A-B]
C       (B-C]            (D-A], (A-B]
D       (C-D]

So each node does have (a superset of) the right data copied.  (Note
that C has (A-B] as a replica in the final version, whereas it
predicted it would be part of its primary range, but that doesn't
matter as long as it ended up w/ the right data on it.)

If instead we add B and C both to D's datacenter we have:

A / B,C,D

Node    Primary range    Replica for
A       (D-A]            (A-B], (B-D]
B       (A-B]            (D-A]
D       (B-D]

Node    Primary range    Replica for
A       (D-A]            (A-C], (C-D]
C       (A-C]            (D-A]
D       (C-D]

Node    Primary range    Replica for
A       (D-A]            (A-B], (B-C], (C-D]
B       (A-B]            (D-A]
C       (B-C]
D       (C-D]

Again each node ends up with the right data.

Are there conditions under which we don't?

After playing around with this in my mind I think that there are not,
but this is tricky so peer review is welcome. :)

Bootstrap in rack-aware mode

Reply via email to