[ https://issues.apache.org/jira/browse/ZOOKEEPER-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087857#comment-13087857 ]
Vishal Kathuria commented on ZOOKEEPER-1154: -------------------------------------------- Hi Flavio, Looking forward to reading your paper. I looked online and found the slide deck but couldn't find a link to the pdf. Could you please forward the link to me? I agree us not testing all the scenarios is a serious problem - it can let regressions slip through and them not getting caught for several years (as it happened with ZOOKEEPER-1156, which has been there in the code for 2 years). I fully understand your concern around simplicity. When I submit the patch, please let me know what you think. I found a way to do this without maintaining additional state/variables. The idea is that if the follower has a zxid that the leader doesn't have a matching committed proposal for, the leader asks the follower to truncate to the zxid for which the leader does have a committed proposal and starts sending diffs from that. My new test is passing with the fix. Interestingly I am seeing JVM crashes in the hammer tests, so I need to investigate that before I can submit the patch. Thanks! > Data inconsistency when the node(s) with the highest zxid is not present at > the time of leader election > ------------------------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-1154 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1154 > Project: ZooKeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.3.3 > Reporter: Vishal Kathuria > Priority: Blocker > Fix For: 3.4.0 > > Original Estimate: 504h > Remaining Estimate: 504h > > If a participant with the highest zxid (lets call it A) isn't present during > leader election, a participant with a lower zxid (say B) might be chosen as a > leader. When A comes up, it will replay the log with that higher zxid. The > change that was in that higher zxid will only be visible to the clients > connecting to the participant A, but not to other participants. > I was able to reproduce this problem by > 1. connect debugger to B and C and suspend them, so they don't write anything > 2. Issue an update to the leader A. > 3. After a few seconds, crash all servers (A,B,C) > 4. Start B and C, let the leader election take place > 5. Start A. > 6. You will find that the update done in step 2 is visible on A but not on > B,C, hence the inconsistency. > Below is a more detailed analysis of what is happening in the code. > Initial Condition > 1. Lets say there are three nodes in the ensemble A,B,C with A being the > leader > 2. The current epoch is 7. > 3. For simplicity of the example, lets say zxid is a two digit number, > with epoch being the first digit. > 4. The zxid is 73 > 5. All the nodes have seen the change 73 and have persistently logged it. > Step 1 > Request with zxid 74 is issued. The leader A writes it to the log but there > is a crash of the entire ensemble and B,C never write the change 74 to their > log. > Step 3 > B,C restart, A is still down > B,C form the quorum > B is the new leader. Lets say B minCommitLog is 71 and maxCommitLog is 73 > epoch is now 8, zxid is 80 > Request with zxid 81 is successful. On B, minCommitLog is now 71, > maxCommitLog is 81 > Step 4 > A starts up. It applies the change in request with zxid 74 to its in-memory > data tree > A contacts B to registerAsFollower and provides 74 as its ZxId > Since 71<=74<=81, B decides to send A the diff. B will send to A the proposal > 81. > Problem: > The problem with the above sequence is that A's data tree has the update from > request 74, which is not correct. Before getting the proposals 81, A should > have received a trunc to 73. I don't see that in the code. If the > maxCommitLog on B hadn't bumped to 81 but had stayed at 73, that case seems > to be fine. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira