[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808004#comment-13808004 ] Flavio Junqueira commented on ZOOKEEPER-1732: - I have to think about this one, but from the log excerpt that Raul posted, one problem I can see is that n.round has different values, so vote.equals in termPredicate when comparing the votes, no? ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808081#comment-13808081 ] Germán Blanco commented on ZOOKEEPER-1732: -- Yes, that is what I mean. The round value in the votes is updated in updateElectionVote() after the election is finished. In the previous code (without the patch) the vote when the election was finished had the epoch of the leader. That is, the epoch that the new leader had when the election started. In the code after the patch, the vote is updated in updateElectionVote() to the epoch that the leader is using after the election is finished, which is one more than the epoch that it was using when the election started. I think that if newEpoch-1 is used to update the election vote, then things should be ok. If that is done, then servers with and without the patch should have the same value of epoch in the vote after the election is finished. It is very good that [~rgs] has spotted this so soon, since it would have been seen in all upgrades from 3.4.5 to 3.4.6. On the other hand, consequences are not too serious. It only happens when servers with different versions are running in the same quorum and it only happens if there is an ensemble running (so there should be no interruption of the service). ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808194#comment-13808194 ] Raul Gutierrez Segales commented on ZOOKEEPER-1732: --- Hmm, I still think this could confuse people rolling a cluster. Sounds like we should revert this for the next release unless we have a fix for it. Smooth upgrades through rolling restarts are an expectation that ZooKeeper has always maintained. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807102#comment-13807102 ] Raul Gutierrez Segales commented on ZOOKEEPER-1732: --- [~fpj], [~abranzyck]: did you guys test this patch when joining a cluster of servers running without this patch (i.e.: trunk, only without this patch)? After rolling the first 2 followers - in a 5 member ensemble - the 3rd follower fails to join with this: {noformat} 2013-10-28 18:43:18,134 - INFO [WorkerReceiver[myid=4]] - Notification: 4 (n.leader), 0x890415 (n.zxid), 0x6 (n.round), LOOKING (n.state), 4 (n.sid), 0x89 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2013-10-28 18:43:18,134 - INFO [WorkerReceiver[myid=4]] - Notification: 2 (n.leader), 0x88002c (n.zxid), 0x (n.round), FOLLOWING (n.state), 0 (n.sid), 0x89 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2013-10-28 18:43:18,135 - INFO [WorkerReceiver[myid=4]] - Notification: 2 (n.leader), 0x88002c (n.zxid), 0x6 (n.round), LEADING (n.state), 2 (n.sid), 0x88 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2013-10-28 18:43:18,135 - INFO [WorkerReceiver[myid=4]] - Notification: 2 (n.leader), 0x88002c (n.zxid), 0x6 (n.round), FOLLOWING (n.state), 3 (n.sid), 0x88 (n.peerEPoch), LOOKING (my state)0 (n.config version) 2013-10-28 18:43:18,136 - INFO [WorkerReceiver[myid=4]] - Notification: 2 (n.leader), 0x88002c (n.zxid), 0x (n.round), FOLLOWING (n.state), 1 (n.sid), 0x89 (n.peerEPoch), LOOKING (my state)0 (n.config version) {noformat} I am guessing IGNOREVALUE (0x) as the round value is causing issues? What was the expected behavior here (i.e.: when dealing with cluster members without this patch during an upgrade)? ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807553#comment-13807553 ] Flavio Junqueira commented on ZOOKEEPER-1732: - hmm, this is odd. I don't understand why the notifications don't have the same round value, the don't care value in this case. The value is also not what I expected, so I might have done something wrong there. Let me have a closer look and report back. Thanks for reporting, [~rgs]. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807565#comment-13807565 ] Raul Gutierrez Segales commented on ZOOKEEPER-1732: --- What's wrong with the round values? i.e.: the two new servers have IGNOREVALUE (sounds correct right?) and the older followers have the current round value (i.e.: 0x6). I thought the problem would be here: {noformat} * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732 */ outofelection.put(n.sid, new Vote(n.leader, IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state)); if (termPredicate(outofelection, new Vote(n.leader, IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state)) checkLeader(outofelection, n.leader, IGNOREVALUE)) { {noformat} IGNOREVALUE doesn't work here, because we are talking to un-patched cluster members. Sorry if I am completely misleading you :) That's as far as I got with my analysis today. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807588#comment-13807588 ] Flavio Junqueira commented on ZOOKEEPER-1732: - I see, my mental model of the problem ignored the fact that there were servers with newer and older versions, my bad. I think the IGNOREVALUE is not really being ignored, I'll come up with a fix, but I'll do it in a different jira. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807633#comment-13807633 ] Thawan Kooburat commented on ZOOKEEPER-1732: May be we should start considering automate rolling upgrade test?. In jenkins we might be able to continuously grab 3.4 branch and perform rolling upgrade to 3.5 and verify that quorum come up ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807673#comment-13807673 ] Flavio Junqueira commented on ZOOKEEPER-1732: - Given that rolling upgrades seem to be very common, it doesn't sound like a bad idea to automate the testing. I think we can't do it with junit, or at least I don't know how. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13801717#comment-13801717 ] Hudson commented on ZOOKEEPER-1732: --- SUCCESS: Integrated in ZooKeeper-trunk #2097 (See [https://builds.apache.org/job/ZooKeeper-trunk/2097/]) ZOOKEEPER-1732. ZooKeeper server unable to join established ensemble (German Blanco via fpj) (fpj: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1534390) * /zookeeper/trunk/CHANGES.txt * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Learner.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java * /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/FLETest.java ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13800766#comment-13800766 ] Germán Blanco commented on ZOOKEEPER-1732: -- Sorry, I didn't realise you changed the name, and I was giving comments to my last version vs. the one before that. To your changes, I only have the comment that I understand that you intend to apply the same formatting to the trunk patch, right? I can do that if you want. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13800769#comment-13800769 ] Flavio Junqueira commented on ZOOKEEPER-1732: - I'm sorry too, I haven't been able to get to it. If you have some time and can generate the trunk patch, I'd appreciate it. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13800881#comment-13800881 ] Hadoop QA commented on ZOOKEEPER-1732: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12609472/ZOOKEEPER-1732.patch against trunk revision 1533161. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1713//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1713//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1713//console This message is automatically generated. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13800150#comment-13800150 ] Germán Blanco commented on ZOOKEEPER-1732: -- +1 to the 3.4 patch. Thank you [~fpj], for the review and for making those changes, I shouldn't have left it like that. That means one blocker less to go for 3.4.6! Thanks a lot also for your work on the release. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-b3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788947#comment-13788947 ] Germán Blanco commented on ZOOKEEPER-1732: -- Is there anything else to be done in this one? ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13789374#comment-13789374 ] Flavio Junqueira commented on ZOOKEEPER-1732: - One thing I'm not happy about your patch is that you use zero as don't care values. For readability, I'd rather have perhaps different method calls or constants reflecting the fact that we are not taking those values into account. Adding comments to the code explaining what's going on sounds like a good thing to do. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784955#comment-13784955 ] Hadoop QA commented on ZOOKEEPER-1732: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12606543/ZOOKEEPER-1732.patch against trunk revision 1528586. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1634//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1634//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1634//console This message is automatically generated. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785329#comment-13785329 ] Marshall McMullen commented on ZOOKEEPER-1732: -- We've just run into this issue running tip of trunk 3.5.0 *without* this patch applied. Are there any proposed workarounds to this problem? I tried removing the stuck node from the ensemble and adding another node in as a replacement but it is now hitting the same problem... It can't join the ensemble either. I'm considering restarting all zookeeper servers in the hopes that a new round of leader election will reset things. Does this sound safe? Are there any other alternatives? Really appreciate any help. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785345#comment-13785345 ] Marshall McMullen commented on ZOOKEEPER-1732: -- Flavio, that suggestion worked perfectly! Simply restarting the leader caused a new round of leader election and things sorted themselves out within a few seconds. Thank you so much for such a prompt reply. Love this community! ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Critical Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784565#comment-13784565 ] Flavio Junqueira commented on ZOOKEEPER-1732: - I have downgraded this issue to major, it is a corner case and unlikely to happen often, but we still need to fix it. I'm thinking that we should update the peer epoch at the end of syncWithLeader rather than where it is in registerWithLeader. After syncing, we know the current epoch, so we should just update it there. I was also thinking that the we could update the zxid as well, although it doesn't matter too much. The indentation is still wrong for me. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784807#comment-13784807 ] Germán Blanco commented on ZOOKEEPER-1732: -- I will update the patches as you propose. It also seems better to me to update this information when the ensemble is finally established. With the current epoch I assume that you mean the new epoch for the ensemble and that this will have to be updated in the followers and in the leader (so no newEpoch-1, but newEpoch). If I misunderstood, please let me know. I don't know how to choose a zxid that is agreed between the leader and all the followers. I will leave that for now, but if you know how to do it, please let me know. I am assuming that you mean the modification in Learner.java for the indentation comment. The indentation was already wrong in that part of the code. If I am not wrong, there are 8 spaces and a tab in the lines above and below the modification. I believe I have put 8 spaces and a tab also in the modified lines. In the editors that I use it looks ok. If you want me to try to fix the indentation around the change, it is ok with me. If you mean another change or you don't see 8 spaces and a tab there or I should use a different combination for these lines, please let me know. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784822#comment-13784822 ] Germán Blanco commented on ZOOKEEPER-1732: -- Oooops! I just noticed that the change in Learner.java is the one that needs to be moved to another method, so if the last indentation comment was about that change, then please never mind about my response. Hopefully I will get the indentation right in the new method :-) ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Fix For: 3.4.6, 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781639#comment-13781639 ] Germán Blanco commented on ZOOKEEPER-1732: -- Thanks a lot for taking a look at this, and for your comments. 1 - I guess you mean the if (n.leader != self.getId()) check in FastLeaderElection.java. I will remove that. 2 - This is because the final election vote in the leader will have newEpoch - 1. The winning vote will contain the epoch of the leader, but once the election finishes, the leader increments the epoch. So actually newEpoch is the epoch that won the election plus one. And the vote reported in the Fast Leader Election is the vote that won the election. If we set the vote to just newEpoch, then we need to update the vote also in the Leader, and not only on Learners as it is currently done. At least that is how I think it works. 3 - I will fix identation and add the comment. The identation is funny around those lines though. 4 - I will change the name and change to protected. 5 - Also will do. Patches will be uploaded in some minutes. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781747#comment-13781747 ] Hadoop QA commented on ZOOKEEPER-1732: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12605873/ZOOKEEPER-1732.patch against trunk revision 1527398. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1606//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1606//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1606//console This message is automatically generated. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13781430#comment-13781430 ] Hadoop QA commented on ZOOKEEPER-1732: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12604546/ZOOKEEPER-1732-3.4.patch against trunk revision 1527129. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1604//console This message is automatically generated. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732-3.4.patch, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754728#comment-13754728 ] János Grásl commented on ZOOKEEPER-1732: Sorry for the last comment, OoO assistant replied to Germán's comment. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752196#comment-13752196 ] Germán Blanco commented on ZOOKEEPER-1732: -- Could anybody please review this? ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752197#comment-13752197 ] János Grásl commented on ZOOKEEPER-1732: - I am out of the office from 2013-08-26 to 2013-08-31. I check my emails on 2013-09-01. If you need immediate assistance or information about question, please contact Csaba Nagy(ecsanag). This is an auto reply message sent by my Out of Office Assistance. We only send and receive email on the basis of the term set out at http://www.ericsson.com/email_disclaimer - ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Assignee: Germán Blanco Priority: Blocker Fix For: 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730980#comment-13730980 ] Hadoop QA commented on ZOOKEEPER-1732: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12596368/ZOOKEEPER-1732.patch against trunk revision 1503101. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1529//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1529//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1529//console This message is automatically generated. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0 Attachments: CREATE_INCONSISTENCIES_patch.txt, zklog.tar.gz, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13724085#comment-13724085 ] Germán Blanco commented on ZOOKEEPER-1732: -- I think now that we also need to do something with the peerEpoch. I can't explain why this hasn't failed so far in my tests, maybe the corner case causing this problem is even more unlikely than what I thought. But the peer epoch value does get sent from the leader to the follower after election, right? So it would be possible to just update the value in the leader election information of the follower, during the synchronization phase of the Zab protocol, instead of loosening the restriction. In that way, there will be at least one check verifying that all the votes come from an ensemble established with the same epoch. What do you think? I will also run the tests again with a trace to see when the inconsistent ensemble is created. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720595#comment-13720595 ] Flavio Junqueira commented on ZOOKEEPER-1732: - bq. joining an ensemble that votes me as the leader. I'm ok with removing it, this is an optimization. If the leader is being re-elected, then it means that the ensemble it is trying to join is not functional, since the leader is not present. To do it, you might as well check the change of ZOOKEEPER-1514 in checkLeader. I think the if block you added is not necessary if you make the change in check leader. bq. taking into account my own votes or votes that put me as a leader when joining an ensemble. I don't think we are currently taking into account the vote of a LOOKING server when processing FOLLOWING/LEADING notifications. If you're talking about endVote, this is the vote corresponding to the leader it elected. bq. removing the check for the election round when joining an established ensemble. Let me give some insight here first. We need to have servers joining an established ensemble because a server may find that a quorum is already following some leader and if it follows the standard procedure of processing notifications, then there are some corner cases that can cause it to keep electing some other server that is also looking. The danger of joining an established ensemble is the following. Say that a minority of followers support a leader L, and a majority M supports L'. L' has enough supporters and is able to commit txns. Now say that a server S in the ensemble of L' crashes and recovers. S talks to L and its minority now forming a majority (say there was one server missing to form a majority). L will tell all servers in its ensemble to truncate causing some txns to be lost. We have a couple of mechanisms that prevent this incorrect truncation from happening. First, S needs to receive FOLLOWING/LEADING notifications from a quorum, not including itself. In this case, the incorrect truncation only happens if S receives a stale message, a message from a server S' that later on followed L'. We prevent this case by having maximum one outstanding notification in QuorumCnxManager in the queue of a peer. If S' has followed L', then its notification must reflect it and S won't receive such a stale message. Overall it sounds fine to only consider the server the followers are following. Note that not only the round could be different, but I believe the zxid could also be different. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: test_loosen_restrictions.tar.gz, zklog.tar.gz, ZOOKEEPER-1732-LOOSEN_RESTRICTIONS.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720835#comment-13720835 ] Hadoop QA commented on ZOOKEEPER-1732: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12594398/ZOOKEEPER-1732.patch against trunk revision 1503101. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1522//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1522//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1522//console This message is automatically generated. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz, ZOOKEEPER-1732.patch I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719402#comment-13719402 ] Flavio Junqueira commented on ZOOKEEPER-1732: - By agree to vote, don't you need a different message pattern, even if the message content is the same? You're still changing the protocol here. Also, we don't need agreement, since different processes can have a different opinion about who the leader should be. They need to agree before they start a new epoch, but that's precisely what the recovery phase of zab does. It does a bit more actually, but the whole state sync up is not relevant to this discussion. bq. it actually doesn't take part in the leader election logic This is not entirely true, the LE step exposes a leader that has the highest zxid among a quorum of servers. Also, I think that you're using LE as the recovery phase of Zab, not that the initial protocol that finds a prospective leader. bq. The new server just checks if the ensemble has a quorum and the leader is alive (sends a notification voting for itself) I believe we have discussed this point in this jira. As you have observed, the ensemble is still able to make progress in the situation you have originally described, so the inconsistent LE information doesn't prevent zookeeper from doing work. The problem is getting a server stuck, which we fix by making sure that a follower is able to send notifications with state that reflects the latest leader election. One option I was actually considering is to loosen the constraint that all FOLLOWING/LEADING notifications need to come from the same LE round. This is possibly too conservative, so it might be ok to change it, but I need to think a bit more about it. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719483#comment-13719483 ] Germán Blanco commented on ZOOKEEPER-1732: -- The proposal meant to set the election epoch to the same value during the initial phase of the Zab protocol. That same value would be the proposed epoch in the LEADERINFO structure. As you say, it is a change in the protocol, even if messages have the same information. And it wouldn't work if there are servers in the quorum with different behaviours (e.g. 3.4.5 and 3.4.6 with this change implemented), since they will end up reporting different election epoch. I hope loosening the constrain works, that would really be an easy solution :-) ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716152#comment-13716152 ] Germán Blanco commented on ZOOKEEPER-1732: -- That sounds good to me. I still see option 3 as more straight for solving the problem, but it does involve quite a mess with updating the protocol and test cases and so on for such a corner case. It seems that there are no more opinions on this. Is it ok if I prepare a patch with the change in the leader election process that you suggest? The test case might be a bit tricky, do you have any suggestion for that? ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716210#comment-13716210 ] Flavio Junqueira commented on ZOOKEEPER-1732: - I think you say that 3 is a more direct way of solving the problem because we would be enforcing that a follower is following the right instance of the leader. It is a fair observation, although I tend to think that leader election is unreliable in nature, so I can really go either way. Given that you are keen on implementing the changes to the recovery handshake, what if we try to outline the precise changes in both cases and try to determine which one to go with once we have that? He are some initial thoughts. For option 3, we add LE information to the LearnerInfo message. The leader checks the version of the protocol and uses the new information in LearnerInfo in the case the protocol version is appropriate. In the case the leader instance information doesn't match, we have two choices: # The leader drops the connection to the follower and the follower goes back to LOOKING; # The leader sends to the follower the LE instance information and the follower updates its vote info. For the second option, I believe we would need a new message, since LEADERINFO only contains a zxid. I'd rather avoid adding another message, though. If we don't change the recovery handshake and use the other approach I outlined, then I believe all changes are concentrated in the FLE class, perhaps some in QuorumPeer as well, I'm not sure. We just need to call sendNotifications() upon receiving a notification while leading. For a follower, when receiving a notification from a LEADING server, it checks if its vote is still valid, updating otherwise. bq. Is it ok if I prepare a patch with the change in the leader election process that you suggest? Sure, it would be great if you could propose a patch, independent of the approach we end up choosing. bq. The test case might be a bit tricky, do you have any suggestion for that? There are some FLE test cases that implement a mock server. I think we should do something similar here. Instead of trying to reproduce the race, we could just test that the follower correctly updates its information upon receiving a notification. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716227#comment-13716227 ] Germán Blanco commented on ZOOKEEPER-1732: -- bq. For option 3, we add LE information to the LearnerInfo message ... How would you add the information to the LearnerInfo message? If the receiving side doesn't support the same version fo the protocol it will not be able to parse the message, right? Do you have in mind to use some current unused field? My suggestion was that in the message with LearnerInfo, the Learner only reports that it supports the additional information i.e. sending 0x10001 as the protocol version. The Leader sees this and then it will include the leader election information together with the LeaderInfo, only if the Learner supports this additional information. The Learner receives the message and it will read the leader election info only if the protocol supported by the Leader is also 0x1001. At this point the Learner can just update its leader election information with the one it got from the Leader. No new message that way :-) bq. If we don't change the recovery handshake and use the other approach I outlined, then I believe all changes are concentrated in the FLE class, ... I also see it that way, and I would do it exactly as you say. It is around 20 lines of code near the end of FastLeaderElection$Messenger$WorkerReceiver, something like this: [https://gist.github.com/germanblanco/6060741]. I don't see any changes in QuorumPeer, but maybe I am missing something. bq. There are some FLE test cases that implement a mock server. I think we should do something similar here. Instead of trying to reproduce the race, we could just test that the follower correctly updates its information upon receiving a notification. Sounds very good. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716234#comment-13716234 ] Germán Blanco commented on ZOOKEEPER-1732: -- The changes in option 3 could be something like this: [https://gist.github.com/germanblanco/6060844]. The regression test doesn't work for these changes, so they still need work, but maybe it helps to explain my intention. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714965#comment-13714965 ] Germán Blanco commented on ZOOKEEPER-1732: -- I tried to refresh the proposal simply by doing updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());, before sending a notification when the server that sends it is part of an established ensemble. The test didn't run for long enough time, because of other reasons, but I think now that it can't work anyway. Reading your alternatives now and the way Votes are compared, I see that zxid and epoch need to be the same in all members of the ensemble and in this race case the follower hasn't received the zxid that the leader used to finish the election. My personal preference would be 3. Because it is faster (follower doesn't go back to LOOKING, it can just update the proposal with the info in LeaderInfo), and it doesn't depend on any more races that could lead to the follower not processing the notification from the leader. If the protocol backward compatibility issues are just more work, then I will be very willing to help as much as I can. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714986#comment-13714986 ] Flavio Junqueira commented on ZOOKEEPER-1732: - bq. My personal preference would be 3. Because it is faster (follower doesn't go back to LOOKING, it can just update the proposal with the info in LeaderInfo) I guess my description was not clear for 3. The idea was that the learner sends its LE info so that the leader can drop it if the learner info is stale, so the follower goes back to looking in this option as well. I'm not worried about making this case fast because this is such a corner case: a follower f elects a leader l, l crashes, l comes back, l is re-elected by a quorum that does not include f, f is able to connect to connect to l. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715026#comment-13715026 ] Germán Blanco commented on ZOOKEEPER-1732: -- I see, trying to make this a bit faster doesn't make sense at all. Sorry, but I am confused about the epoch handling in the initial negotiation between the Follower and the Leader. The current FOLLOWERINFO QuorumPacket seems to contain already the acceptedEpoch, isn't it already possible for the Leader to check that value and reject the connection if it is wrong? ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715042#comment-13715042 ] Flavio Junqueira commented on ZOOKEEPER-1732: - Right, this is a bit confusing. Epoch in the context of leader election is not the same as the epoch of zxids, that's one reason why I often call it LE round, to disambiguate. The epoch you're referring to has to do with txn identifiers (zxids), not LE epochs (or rounds). ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715056#comment-13715056 ] Germán Blanco commented on ZOOKEEPER-1732: -- Answering myself, I guess the acceptedEpoch in FOLLOWERINFO is not the epoch that was seen last in the election, but the epoch of the last transaction recorded by the Follower. So it doesn't help in this case. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715057#comment-13715057 ] Germán Blanco commented on ZOOKEEPER-1732: -- Sorry, I forgot to refresh the page :-) ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715060#comment-13715060 ] Flavio Junqueira commented on ZOOKEEPER-1732: - ;-) ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715069#comment-13715069 ] Germán Blanco commented on ZOOKEEPER-1732: -- I am thinking that, giving that we don't care if this is solved a few milliseconds later, one way to handle the backward compatibility issues would be this: - Follower/Observer sends an increased protocol number in FOLLOWERINFO/OBSERVERINFO. - When the Leader sees the new version, it sends the LE epoch and zxid in LeaderInfo. - Follower/Observer check if the Leader has an updated protocol version and if it does, they read the LE epoch and zxid from LeaderInfo. If they are not the same as the LE epoch and zxid that they have, they start LOOKING again. Does that sound ok? ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715077#comment-13715077 ] Flavio Junqueira commented on ZOOKEEPER-1732: - I'm not sure why you think it is better to touch the Zab protocol rather than simply stop following if the follower vote is outdated. It is the simplest thing we can do and it is correct. I don't think touching the Zab protocol is a good idea, so I still prefer option 1 above. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715089#comment-13715089 ] Germán Blanco commented on ZOOKEEPER-1732: -- I would prefer not to touch the Zab protocol as well, but I don't want to leave any loose ends. I thought that there was no guarantee that the Follower will receive a notification from the Leader after it connects. Do you think that this will always happen? This notification is sent through the leader election port, so I guess that it is difficult to make assumptions on when it reaches its destination with respect with when the Follower connects to the Leader. And going deeper in the unlikely combinations, the leader election connection could be lost at some point and Follower and Leader remain connected through the Zab connection. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715115#comment-13715115 ] Flavio Junqueira commented on ZOOKEEPER-1732: - I understand your concern, it is a valid one. Our goal however shouldn't be to have every follower having the most recent notification from the leader. Our goal is to have a leader that has enough supporters so that we can make progress. If all servers are either following/leading, then it doesn't matter if a follower has stale LE information. But, it does become an issue in the case that you uncovered with your logs. In this scenario, the stale follower will receive a more recent notification from either the leader, when it is trying to be re-elected, or from the follower that is stuck. In either case, it will be able to determine that it is stale and stop following. The bottom line is that the follower stops following only if it realizes that it is stale. If it doesn't hear anything, then it just keeps going. Does it work? About changing the protocol, in my experience, changing messages is a pain because there are many subtle cases and it is quite easy to get it wrong. It is best not to touch it. I think we can take care of this case without really changing the messages we are sending. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715120#comment-13715120 ] Flavio Junqueira commented on ZOOKEEPER-1732: - ... I should have said from the server that is stuck, not from the follower that is stuck. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715130#comment-13715130 ] Germán Blanco commented on ZOOKEEPER-1732: -- If I am not wrong, the follower that is stuck never accepts the leadership of the Leader in the ensemble, because it is an established ensemble and it sees no quorum in it. So it will only send notifications proposing itself as the leader. And the leader of the ensemble sends the Notifications only to the follower that is stuck, or? So there is actually no chance for the stale follower to receive the updated leader election information after the initial election is finished. I agree that one must be careful when changing network protocols and dealing with backwards compatibility, but if you asked me, I think that it is much easier to make mistakes doing multithreaded concurrent Java programming :-P ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715474#comment-13715474 ] Flavio Junqueira commented on ZOOKEEPER-1732: - bq. there is actually no chance for the stale follower to receive the updated leader election information after the initial election is finished. What if the leader, upon receiving a notification, instead of responding only to the sender, it sends a batch of notifications? This way we can perform the check I was mentioning before and there is no real change to the protocol, we are just sending more messages every now and then. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715514#comment-13715514 ] Flavio Junqueira commented on ZOOKEEPER-1732: - I was actually thinking that if we do what I just proposed above, then the follower should just update its vote. If a server is following another it believes to be the leader, then it doesn't matter how it got there, it matters that one believes it is following and the other believes it is leading. My proposal more concretely is to have the leader broadcasting its notification upon receiving a notification for a server that is LOOKING. If a follower receives a notification from the guy it is following and it notices that its vote is outdated, then it updates its vote accordingly. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714783#comment-13714783 ] Flavio Junqueira commented on ZOOKEEPER-1732: - bq. If that is the case, then it would be solved by refreshing the proposal after the Follower connects to the Leader, right? How would you refresh the proposal? In my view, there are three ways to solve this issue: # If a follower receives a notification from the guy it thinks it is following and the notification is more recent (later round or higher zxid), then it stops following and goes back to LOOKING. # Same as before but instead of transitioning to LOOKING, the follower updates its notification and keeps trying to connect to the leader. # We change LeanerInfo so that it carries leader election information for verification purposes. I think we should do the first one. The second one bypasses the LE protocol, so I'm not in favor of going that direction, although it might not break anything if we do it. The third option changes the protocol, so it is a bit of a pain to deal with the backward compatibility stuff. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714506#comment-13714506 ] Germán Blanco commented on ZOOKEEPER-1732: -- Thanks a lot for your analysis, Flavio! If that is the case, then it would be solved by refreshing the proposal after the Follower connects to the Leader, right? I will try to make a modification with a refreshment of the proposal and run the test again to check what happens. ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713801#comment-13713801 ] Germán Blanco commented on ZOOKEEPER-1732: -- It seems that the two servers in the enssemble are sending Notifications with a different peerEpoch to the one out of the ensemble: 2013-07-19 10:17:00,833 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 3 (n.leader), 0xb80099 (n.zxid), 0xb9 (n.round), FOLLOWING (n.state), 2 (n.sid), 0xb8 (n.peerEPoch), LOOKING (my state) 2013-07-19 10:17:00,833 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@542] - Notification: 3 (n.leader), 0xb90052 (n.zxid), 0xba (n.round), LEADING (n.state), 3 (n.sid), 0xb9 (n.peerEPoch), LOOKING (my state) Is that correct? ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Critical Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1732) ZooKeeper server unable to join established ensemble
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713933#comment-13713933 ] Flavio Junqueira commented on ZOOKEEPER-1732: - Here is my analysis of the logs. Server 3 has been elected two times, both times with support of Server 1: {noformat} 2013-07-19 10:16:09,746 [myid:3] - DEBUG [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:30103:FastLeaderElection@493] - About to leave FLE instance: leader=3, zxid=0xb80099, my id=3, my state=LEADING 2013-07-19 10:16:26,667 [myid:3] - DEBUG [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:30103:FastLeaderElection@493] - About to leave FLE instance: leader=3, zxid=0xb90052, my id=3, my state=LEADING {noformat} Server 2 elects Server 3 but loses the connection to Server 3 right after: {noformat} 2013-07-19 10:16:20,858 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:30102:Follower@63] - FOLLOWING - LEADER ELECTION TOOK - 47 2013-07-19 10:16:20,858 [myid:2] - WARN [RecvWorker:3:QuorumCnxManager$RecvWorker@762] - Connection broken for id 3, my id = 2, error = {noformat} And it doesn't seem to go into a new round of leader election. Because it is not trying to elect a new leader, its vote reflects the state of the first leader instance of Server 3. Now, Server 3 later on loses its connection to Server 1: {noformat} 2013-07-19 10:16:34,307 [myid:3] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker@762] - Connection broken for id 1, my id = 3, error = {noformat} but it doesn't seem to care, so it must have the support of Server 2. Server 2 again seems to be referring to a previous leader instance of Server 3, so its support to Server 3 must be surviving the crash of Server 3 around 2013-07-19 10:16:20,858 [myid:2] and my guess is that Server is getting confused about dropping the connection right after electing Server 3 and it is trying to establish a new connection, which succeeds when Server 3 comes back up. I think there is a race there ZooKeeper server unable to join established ensemble Key: ZOOKEEPER-1732 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1732 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.5 Environment: Windows 7, Java 1.7 Reporter: Germán Blanco Priority: Blocker Fix For: 3.5.0, 3.4.6 Attachments: zklog.tar.gz I have a test in which I do a rolling restart of three ZooKeeper servers and it was failing from time to time. I ran the tests in a loop until the failure came out and it seems that at some point one of the servers is unable to join the enssemble formed by the other two. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira