[ https://issues.apache.org/jira/browse/ZOOKEEPER-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201875#comment-17201875 ]
maoling commented on ZOOKEEPER-2832: ------------------------------------ [~anaud] Thanks for digging it. A good start. I will recheck what you said above. branch-3.5/branch-3.6/master cannot reproduce it anymore. As you find branch-3.4.9 had this issue, I will check whether this bug exist in the 3.4.14. Branch-3.4.x now are never maintained. We really need to find the root cause of this issue, learning a lesson from it to avoid similar mistakes I guess you may also interested in digging into ZOOKEEPER-3875 which still only exists in branch3.5:) > Data Inconsistency occurs if follower has uncommitted transaction in the log > while synchronizing with the leader that has the lower last processed zxid > ------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-2832 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2832 > Project: ZooKeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.4.9 > Reporter: Beom Heyn Kim > Priority: Major > Fix For: 3.4.10 > > Attachments: zookeeper-2832.patch > > > Synchronization code may fail to truncate an uncommitted transaction in the > follower’s transaction log. Here is a scenario: > > Initial condition: > Start the ensemble with three nodes A, B and C with C being the leader > The current epoch is 1 > For simplicity of the example, let’s say zxid is a two digit number, with > epoch being the first digit > Create two znodes ‘key0’ and ‘key1’ whose value is ‘0’ and ‘1’, respectively > The zxid is 12 -- 11 for creating key0 and 12 for creating key1. (For > simplicity of the example, the zxid gets increased only by transactions > directly changing the data of znodes.) > All the nodes have seen the change 12 and have persistently logged it > Shut down all > > Step 1 > Start Node A and B. Epoch becomes 2. Then, a request, setData(key0, 1000), > with zxid 21 is issued. The leader B writes it to the log but Node A is > shutdown before writing it to the log. Then, the leader B is also shut down. > The change 21 is applied only to B but not to A or C. > > Step 2 > Start Node A and C. Epoch becomes 3. Node A has the higher zxid than Node C > (i.e. 20 > 12). So, Node A becomes the leader. Yet, the last processed zxid > is 12 for both Node A and C. So, they are in sync already. Node A sends an > empty DIFF to Node C. Node C takes a snapshot and creates snapshot.12. Then, > A and C are shut down. Now, C has the higher zxid than Node B. > > Step 3 > Start Node B and C. Epoch becomes 4. Node C has the higher zxid than Node B > (i.e. 30 > 21). So, Node C becomes the leader. Node B and C has the different > last processed zxid (i.e. 21 vs 12), and the LinkedList object ‘proposals’ is > empty. Thus, Node C sends SNAP to Node B. Node B takes a clean snapshot and > creates snapshot.12 as the zxid 12 is the last processed zxid of the leader > C. (Note the newly created snapshot on B is assigned the lower zxid then the > change 21 in the log). Then, the request, setData(key1, 1001), with zxid 41 > is issued. Both B and C apply the change 41 into their logs. (Note that now B > and C have the same last processed zxid) Then, B and C are shut down. > > Step 4 > Start Node B and C. Epoch becomes 5. Node B and C use their local log and > snapshot files to restore their in-memory data tree. Node B has 1000 as the > value of key0, because it’s latest valid snapshot is snapshot.12 and there > was a later transaction with zxid 21 in its log. Yet, Node C has 0 as the > value of key0, because the change 21 was never written on C. Node C is the > leader. Node B and C have the same last processed zxid, i.e. 41. So, they are > considered to be in sync already, and Node C sends an empty DIFF to Node B. > So, the synchronization completes with the initially restored in-memory data > tree on B and C. > > Problem > The value of key0 on B is 1000, while the value of the key0 on Node C is 0. > The LearnerHandler.run on C at Step 3, never sends TRUNC but just SNAP. > So, the change 21 was never truncated on B. Also, at step 4, since B uses the > snapshot of the lower zxid to restore its in-memory data tree, the change 21 > could get into the data tree. Then, the leader C at the step 4 did not send > SNAP, because the change 41 made to both B and C makes the leader C think the > B and C are already in sync. Thus, data inconsistency occurs. > > The attached test case can deterministically reproduce the bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)