[jira] [Commented] (ZOOKEEPER-1549) Data inconsistency when follower is receiving a DIFF with a dirty snapshot

Flavio Junqueira (JIRA) Thu, 27 Dec 2012 09:10:13 -0800

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540043#comment-13540043
 ]


Flavio Junqueira commented on ZOOKEEPER-1549:
---------------------------------------------

[~thawan] Ok, I got a better clue of what you're referring to with the 
syncLimit comment, but I'm not there yet. syncLimit has always been a parameter 
that limits the amount of time a follower can take to catch up, so I'm not 
proposing any change to the semantics of syncLimit, just to make it clear.

About having it for 3.5.0, I suggest we make it a blocker for 3.5.0. If 
necessary, I also suggest we delay the release to have it in, although 
certainly not ideal. 

Given that we will be creating a new branch (3.5), I suppose that we don't need 
to have some of the backward-compatibility stuff that we currently have in the 
code to make sure that 3.3 servers talk to 3.4. servers, yes? Perhaps this is a 
question for [~phunt].

It would be awesome if you could work on the leader part. I think the only 
tricky part on the follower side is making sure that everything is persisted at 
the right time, but I don't think that we will need major code changes. If you 
can work on both, I'm happy to be the reviewer of your code, and otherwise I 
can work on the follower part and we review each other's patches. 
                
> Data inconsistency when follower is receiving a DIFF with a dirty snapshot
> --------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1549
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1549
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.4.3
>            Reporter: Jacky007
>            Priority: Blocker
>         Attachments: case.patch
>
>
> the trunc code (from ZOOKEEPER-1154?) cannot work correct if the snapshot is 
> not correct.
> here is scenario(similar to 1154):
> Initial Condition
> 1.    Lets say there are three nodes in the ensemble A,B,C with A being the 
> leader
> 2.    The current epoch is 7. 
> 3.    For simplicity of the example, lets say zxid is a two digit number, 
> with epoch being the first digit.
> 4.    The zxid is 73
> 5.    All the nodes have seen the change 73 and have persistently logged it.
> Step 1
> Request with zxid 74 is issued. The leader A writes it to the log but there 
> is a crash of the entire ensemble and B,C never write the change 74 to their 
> log.
> Step 2
> A,B restart, A is elected as the new leader,  and A will load data and take a 
> clean snapshot(change 74 is in it), then send diff to B, but B died before 
> sync with A. A died later.
> Step 3
> B,C restart, A is still down
> B,C form the quorum
> B is the new leader. Lets say B minCommitLog is 71 and maxCommitLog is 73
> epoch is now 8, zxid is 80
> Request with zxid 81 is successful. On B, minCommitLog is now 71, 
> maxCommitLog is 81
> Step 4
> A starts up. It applies the change in request with zxid 74 to its in-memory 
> data tree
> A contacts B to registerAsFollower and provides 74 as its ZxId
> Since 71<=74<=81, B decides to send A the diff. 
> Problem:
> The problem with the above sequence is that after truncate the log, A will 
> load the snapshot again which is not correct.
> In 3.3 branch, FileTxnSnapLog.restore does not call listener(ZOOKEEPER-874), 
> the leader will send a snapshot to follower, it will not be a problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1549) Data inconsistency when follower is receiving a DIFF with a dirty snapshot

Reply via email to