[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347026#comment-16347026
 ] 

Robert Joseph Evans commented on ZOOKEEPER-2845:
------------------------------------------------

I have a fix that I will be posting shortly.  I need to clean up the patch and 
make sure that I get pull requests ready for all of the branches that 
ZOOKEEPER-2926 went into.

 

The following table describes the situation that allows a node to get into an 
inconsistent state.

 
|| ||N1||N2||N3||
|Start with cluster in sync N1 is leader|0x0 0x5|0x0 0x5|0x0 0x5|
|N2 and N3 go down|0x0 0x5| | |
|Proposal to N1 (fails with no quorum)|0x0 0x6| | |
|N2 and N3 return, but N1 is restarting.  N2 elected leader| |0x1 0x0|0x1 0x0|
|A proposal is accepted| |0x1 0x1|0x1 0x1|
|N1 returns and is trying to sync with the new leader N2|0x0 0x6|0x1 0x1|0x1 
0x1|

 

At this point the code in {{LearnerHandler.syncFollower}} takes over to bring 
N1 into sync with N2 the new leader.

That code checks the following in order
 # Is there a {{forceSync}}? Not in this case
 # Are the two zxids in sync already?  No {{0x0 0x6 != 0x1 0x1}}
 # is the peer zxid > the local zxid (and peer didn't just rotate to a new 
epoch)? No {{0x0 0x6 < 0x1 0x1}}
 # is the peer zxid in between the max committed log and the min committed log? 
 In this case yes it is, but it shouldn't be.  The max committed log is {{0x1 
0x1}}.  The min committed log is {{0x0 0x5}} or something likely below it 
because it is based off of distance in the edit log.  The issue is that once 
the epoch changes, {{0x0}} to {{0x1}}, the leader has no idea if the edits are 
in its edit log without explicitly checking for them.

 

The reason that ZOOKEEPER-2926 exposed this is because previously when a leader 
was elected the in memory DB was dropped and everything was reread from disk.  
When this happens the {{0x0 0x6}} proposal was lost.  But it is not guaranteed 
to be lost in all cases.  In theory a snapshot could be taken triggered by that 
proposal, either on the leader, or on a follower that also allied the proposal, 
but does not join the new quorum in time.   As such ZOOKEEPER-2926 really just 
extended the window of an already existing race.  But it extended it almost 
indefinitely so it is much more likely to happen.

 

My fix is to update {{LearnerHandler.syncFollower}} to only send a {{DIFF}} if 
the epochs are the same.  If they are not the same we don't know if something 
we inserted that we don't know about.

 

> Data inconsistency issue due to retain database in leader election
> ------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2845
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2845
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.4.10, 3.5.3, 3.6.0
>            Reporter: Fangmin Lv
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>
> In ZOOKEEPER-2678, the ZKDatabase is retained to reduce the unavailable time 
> during leader election. In ZooKeeper ensemble, it's possible that the 
> snapshot is ahead of txn file (due to slow disk on the server, etc), or the 
> txn file is ahead of snapshot due to no commit message being received yet. 
> If snapshot is ahead of txn file, since the SyncRequestProcessor queue will 
> be drained during shutdown, the snapshot and txn file will keep consistent 
> before leader election happening, so this is not an issue.
> But if txn is ahead of snapshot, it's possible that the ensemble will have 
> data inconsistent issue, here is the simplified scenario to show the issue:
> Let's say we have a 3 servers in the ensemble, server A and B are followers, 
> and C is leader, and all the snapshot and txn are up to T0:
> 1. A new request reached to leader C to create Node N, and it's converted to 
> txn T1 
> 2. Txn T1 was synced to disk in C, but just before the proposal reaching out 
> to the followers, A and B restarted, so the T1 didn't exist in A and B
> 3. A and B formed a new quorum after restart, let's say B is the leader
> 4. C changed to looking state due to no enough followers, it will sync with 
> leader B with last Zxid T0, which will have an empty diff sync
> 5. Before C take snapshot it restarted, it replayed the txns on disk which 
> includes T1, now it will have Node N, but A and B doesn't have it.
> Also I included the a test case to reproduce this issue consistently. 
> We have a totally different RetainDB version which will avoid this issue by 
> doing consensus between snapshot and txn files before leader election, will 
> submit for review.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to