Re: question on ZAB protocol

Flavio Junqueira Wed, 13 Jul 2011 05:51:50 -0700

Hi Yang, Some answers between lines:

On Jul 13, 2011, at 3:45 AM, Yang wrote:

the leader broadcasts a write to all replicas, and then waits for a
quorum to reply, before sending out the COMMIT.
why is the quorum necessary (i.e. why can't the leader just wait for
one reply and start sending the COMMIT?)??

now that I think about it, it seems that waiting for just one reply is
enough, because the connection from leader to replicas are FIFO, as
long as the replicas do not die,
they will eventually get the writes, even though the writes arrive at
them after the leader starts the COMMIT.

Just to clarify, FIFO guarantees that a prefix of what has been sentis received and that the messages follow the send order. Delivery isnot guaranteed, though. If we deliver no message, FIFO is stillpreserved.

the only reason I can think of  for using a quorum is to tolerate more
failures: if the only replied replica  dies, and leader dies, then we
lose that  latest write.

Committing a transaction based on a single acknowledgement does notguarantee the durability of the operation in general if you want totolerate at least one crash. Once a client receives a confirmation ofan operation, enough ZooKeeper replicas must have persisted thecorresponding update. Enough in this case implies that any subset ofreplicas used to recover upon a leader crash must have at least oneserver containing that update.

Now, I think that your optimization works only in the special casethat the client that issued the operation is connected to a followerand we have 3 replicas, assuming majority quorums. If the client isconnected to the leader directly, then it doesn't work. It alsodoesn't work if we need to tolerate at least 2 crashes, independent ofthe status of the replica the client is connected to.

by requiring f ACKs, you can tolerate f-1 failures. but then you don't
really need 2f+1 nodes in the ZK cluster, just f+1 is enough.

This derivation might be obvious, but let me put it down forcompleteness at least. Say that we use at least q servers to recoverupon a leader crash. To ensure progress, we need at least "n >= q + f"replicas, where f is a threshold on the number of simultaneous crashesyou're willing to tolerate. If "q <= f", then we can't guarantee thattransactions committed during a recovery step are durable, sinceanother subsequent recovery step might not have any replica in commonwith the previous one. Consequently, we have that q is at least f+1,which gives us "n >= 2f + 1"


Does it make sense?

-Flavio

Re: question on ZAB protocol

Reply via email to