Thanks Alex for the detailed explanation, I understand much better. Following the your explanation, another question hits me. Say, that, only one follower A persist the write to the disk and acks the proposal to the leader,but others don't.
The all the quorum are restarted. There are two things may happen 1. B,C,D starts first, assume D is the leader. Then A's last write will be lost because A will sync with D, I think this is OK because, the client of A didn't get the response that A's write is successful. 2. A,B,C starts first Assume A is the leader since it has more recent transaction id, then the whole quorum will have this write because B,C will sync with A. At last, the whole quorum will have the write. Is this the expected behavior? I don't think so because 1 and 2 are conflicting. In 1, A's write is inaccessible,but in 2, A's write is accessible. Is there something that I miss? Thanks. [email protected] From: Alexander Shraer Date: 2015-01-06 13:26 To: [email protected] Subject: Re: Question about the two-phrase commit Hi, A few things are not accurate. First, ZooKeeper implements consensus on each operation, not 2 phase commit. There are differences in the definition and guaratees of 2PC and Consensus. > 2. Followers ack the proposal and writes the change to the disk(but not persisted yet?) Before acking a follower writes/persists the proposed operation to disk (an operations log). > 4. When each follower receives the commit request, follower commits the changes(persist the change for ever?) The commit operaiton does not trigger a write to disk. What it does is that now the state change is applied to an in memory data structure holding ZooKeeper state. Since reads are served from that in-memory data structure, the write is now visible to reads. > d. Assume that When the response from A is back to client telling the client that the write is successful, But in the period, the > other followers (B,C,D) haven't even received the commit request, and B,C,D are down without getting a chance to commit the > change. Whether an operation was "committed" or not is not important during recovery. What's important is that a quorum acked it. So when you restart B, C, D, one of them necessarily has this write in its log. The others may have it or not but in any case, who ever is elected leader will have this operation in the log. When a leader is established the log is applied to memory (there are also snapshots which allow truncating the log, so the snapshot is applied first and then the log). When A syncs with the new leader, the only thing that it can loose are operations that were not acked by a quorum previously, and hence was not committed. Alex On Sun, Jan 4, 2015 at 11:05 PM, [email protected] <[email protected]> wrote: > > In the above process, something rare could happen > a. Say,there are 5 nodes in the quorum(1 leader E, 4 follower A,B,C,D). > b. The write operation is issued by the client that connects to Follower A > c. A commits the changes and response to the client that the writer > succeeds. > d. Assume that When the response from A is back to client telling the > client that the write is successful, But in the period, the other followers H
