[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201875#comment-17201875
 ] 

maoling commented on ZOOKEEPER-2832:
------------------------------------

[~anaud] 

Thanks for digging it. A good start. I will recheck what you said above. 
branch-3.5/branch-3.6/master cannot reproduce it anymore. As you find 
branch-3.4.9 had this issue, I will check whether this bug exist in the 3.4.14. 
Branch-3.4.x now are never maintained.

We really need to find the root cause of this issue, learning a lesson from it 
to avoid similar mistakes

I guess you may also interested in digging into ZOOKEEPER-3875 which still only 
exists in branch3.5:)

> Data Inconsistency occurs if follower has uncommitted transaction in the log 
> while synchronizing with the leader that has the lower last processed zxid
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2832
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2832
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.4.9
>            Reporter: Beom Heyn Kim
>            Priority: Major
>             Fix For: 3.4.10
>
>         Attachments: zookeeper-2832.patch
>
>
> Synchronization code may fail to truncate an uncommitted transaction in the 
> follower’s transaction log. Here is a scenario:
>  
> Initial condition:
> Start the ensemble with three nodes A, B and C with C being the leader
> The current epoch is 1
> For simplicity of the example, let’s say zxid is a two digit number, with 
> epoch being the first digit
> Create two znodes ‘key0’ and ‘key1’ whose value is ‘0’ and ‘1’, respectively
> The zxid is 12 -- 11 for creating key0 and 12 for creating key1. (For 
> simplicity of the example, the zxid gets increased only by transactions 
> directly changing the data of znodes.)
> All the nodes have seen the change 12 and have persistently logged it
> Shut down all
>  
> Step 1
> Start Node A and B. Epoch becomes 2. Then, a request, setData(key0, 1000), 
> with zxid 21 is issued. The leader B writes it to the log but Node A is 
> shutdown before writing it to the log. Then, the leader B is also shut down. 
> The change 21 is applied only to B but not to A or C.
>  
> Step 2
> Start Node A and C. Epoch becomes 3. Node A has the higher zxid than Node C 
> (i.e. 20 > 12). So, Node A becomes the leader. Yet, the last processed zxid 
> is 12 for both Node A and C. So, they are in sync already. Node A sends an 
> empty DIFF to Node C. Node C takes a snapshot and creates snapshot.12. Then, 
> A and C are shut down. Now, C has the higher zxid than Node B.
>  
> Step 3
> Start Node B and C. Epoch becomes 4. Node C has the higher zxid than Node B 
> (i.e. 30 > 21). So, Node C becomes the leader. Node B and C has the different 
> last processed zxid (i.e. 21 vs 12), and the LinkedList object ‘proposals’ is 
> empty. Thus, Node C sends SNAP to Node B. Node B takes a clean snapshot and 
> creates snapshot.12 as the zxid 12 is the last processed zxid of the leader 
> C. (Note the newly created snapshot on B is assigned the lower zxid then the 
> change 21 in the log). Then, the request, setData(key1, 1001), with zxid 41 
> is issued. Both B and C apply the change 41 into their logs. (Note that now B 
> and C have the same last processed zxid) Then, B and C are shut down.
>  
> Step 4
> Start Node B and C. Epoch becomes 5.  Node B and C use their local log and 
> snapshot files to restore their in-memory data tree. Node B has 1000 as the 
> value of key0, because it’s latest valid snapshot is snapshot.12 and there 
> was a later transaction with zxid 21 in its log. Yet, Node C has 0 as the 
> value of key0, because the change 21 was never written on C. Node C is the 
> leader. Node B and C have the same last processed zxid, i.e. 41. So, they are 
> considered to be in sync already, and Node C sends an empty DIFF to Node B. 
> So, the synchronization completes with the initially restored in-memory data 
> tree on B and C.
>  
> Problem
> The value of key0 on B is 1000, while the value of the key0 on Node C is 0. 
> The LearnerHandler.run on C at Step 3,    never sends TRUNC but just SNAP. 
> So, the change 21 was never truncated on B. Also, at step 4, since B uses the 
> snapshot of the lower zxid to restore its in-memory data tree, the change 21 
> could get into the data tree. Then, the leader C at the step 4 did not send 
> SNAP, because the change 41 made to both B and C makes the leader C think the 
> B and C are already in sync. Thus, data inconsistency occurs.
>  
> The attached test case can deterministically reproduce the bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to