anaud created ZOOKEEPER-3972:
--------------------------------

             Summary: Convergence fail when a follower tries to resync with a 
leader having incomplete commitlog
                 Key: ZOOKEEPER-3972
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3972
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.5.8
            Reporter: anaud
         Attachments: 
zookeeper-testResyncWithLeaderHavingIncompleteCommitlog.patch

It is possible that a leader may have incomplete commitlog because it resync'ed 
with the old leader via SNAPSHOT replication.

Then, a follower may try to resync with the leader, but because there may be 
some transactions the follower missed earlier and the leader does not have in 
its commitlog.

They decided to use txnlog + commitlog to resync. However, this will lead to 
convergence failure because the leader does not send the missing transactions 
that are not in its commitlog.

Here is the abstract step to reproduce the bug, and I attached the patch with 
the test case that can reproduce the bug.

Initially, node A,B,C are all sync'ed.
1. Node A crashes; setData 0x11 on B and C
2. Node B and C crash
3. Node A and B restart
4. Node A crashes; setData 0x21 on B
5. Node B crashes
6. Node B and C restart
7. Node C crashes; setData 0x32 on B
8. Node A and C restart
9. Node B restarts


At step 6, C is a follower getting a snapshot from B, and C does not have the 
transaction 0x21 in its commitlog (only in the snapshot).

At step 8, C is the leader which does not have 0x21 in its commitlog, which A 
never gets.

In the end, 0x21 only exists on B and C, but not on A.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to