[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027316#comment-13027316 ] Hudson commented on ZOOKEEPER-975: -- Integrated in ZooKeeper-trunk #1168 (See [https://builds.apache.org/hudson/job/ZooKeeper-trunk/1168/]) ZOOKEEPER-975. new peer goes in LEADING state even if ensemble is online. (vishal via fpj) > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, > ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025877#comment-13025877 ] Flavio Junqueira commented on ZOOKEEPER-975: The problem is not with the patch according to the console output: {noformat} [javadoc] javadoc: warning - Error fetching URL: http://java.sun.com/javase/6/docs/api/package-list {noformat} so it is +1 for me. If no one has anything against it, I'll commit it later. Thanks, Vishal. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, > ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024920#comment-13024920 ] Flavio Junqueira commented on ZOOKEEPER-975: Does anyone know why we are getting this javadoc warning? ZOOKEEPER-1052 also got -1 on javadoc, and I'm not sure what the problem is. Any hint would be welcome. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, > ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024915#comment-13024915 ] Vishal K commented on ZOOKEEPER-975: the findbugs failure here is due to ZOOKEEPER-1052 as pointed out by Flavio. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, > ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024913#comment-13024913 ] Hadoop QA commented on ZOOKEEPER-975: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477322/ZOOKEEPER-975.patch6 against trunk revision 1091841. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/232//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/232//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/232//console This message is automatically generated. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, > ZOOKEEPER-975.patch5, ZOOKEEPER-975.patch6 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > rec
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019429#comment-13019429 ] Hadoop QA commented on ZOOKEEPER-975: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476255/ZOOKEEPER-975.patch5 against trunk revision 1091841. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/226//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/226//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/226//console This message is automatically generated. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4, > ZOOKEEPER-975.patch5 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire hi
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019118#comment-13019118 ] Hadoop QA commented on ZOOKEEPER-975: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476190/ZOOKEEPER-975.patch4 against trunk revision 1091314. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/225//testReport/ Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/225//console This message is automatically generated. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3, ZOOKEEPER-975.patch4 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017172#comment-13017172 ] Flavio Junqueira commented on ZOOKEEPER-975: Hi Vishal, What if we have a test with 3 servers with ids 1, 2, and 3. We start 1 and 2, and let 2 be elected. Once 2 is elected, we start 3. With the current trunk code, 3 should declare itself LEADING, since it will receive the initial notifications of the other two processes. With your patch, this situation shouldn't happen. What do you think? > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015975#comment-13015975 ] Vishal K commented on ZOOKEEPER-975: Hi Flavio, Do you think we need a test for this? I was looking through the code to see how we can write a test. What we can do is insert notifications in recvqueue for a peer, then call lookForLeader(), and monitor the state/proposdZxid/proposedLeader/ect. This will let us feed whatever notifications we want to FLE. The other peers should just ignore the notifications (or send notifications that we want them to send). However, for this we will have to make changes to FastLeaderElection so that one can overload its Messenger, modify recvqueue, set proposedLeader, propsedZxid, etc from the test. I think this will be a good change in general so that we can feed notifications to a peer and test for corner cases, but a bit time consuming. I am not sure how much that will help for this particular bug though. What do you think? -Vishal > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014532#comment-13014532 ] Flavio Junqueira commented on ZOOKEEPER-975: Hi Vishal, Apart from not having a test, it is +1 for me. Looks good. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012666#comment-13012666 ] Flavio Junqueira commented on ZOOKEEPER-975: Hi Vishal, In the run() method of WorkReceiver we perform a check already: {noformat} /* * If it is from an observer, respond right away. * Note that the following predicate assumes that * if a server is not a follower, then it must be * an observer. If we ever have any other type of * learner in the future, we'll have to change the * way we check for observers. */ if(!self.getVotingView().containsKey(response.sid)){ Vote current = self.getCurrentVote(); ToSend notmsg = new ToSend(ToSend.mType.notification, current.id, current.zxid, logicalclock, self.getPeerState(), response.sid); sendqueue.offer(notmsg); {noformat} I'll been reviewing your patch shortly. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012598#comment-13012598 ] Hadoop QA commented on ZOOKEEPER-975: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12474906/ZOOKEEPER-975.patch3 against trunk revision 1082362. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/209//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/209//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/209//console This message is automatically generated. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2, ZOOKEEPER-975.patch3 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This mes
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012196#comment-13012196 ] Vishal K commented on ZOOKEEPER-975: Hi Flavio, I think a peer should check whether the received notification originated from a cluster member. This check is done for LOOKING notifications, but after modifying proposedLeader, xid and epoch. The verification is not performed for leading/following. I am planning to add this check at the very beginning: else if(self.getVotingView().containsKey(n.sid)) { switch (n.state) { case LOOKING: [...] } Do you see any issues with this (especially while using OBSERVER_ID)? > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13012161#comment-13012161 ] Vishal K commented on ZOOKEEPER-975: Thanks for the clarification. I will reintroduce separate data structure for LEADING/FOLLOWING notifications. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011639#comment-13011639 ] Flavio Junqueira commented on ZOOKEEPER-975: The problem I see with mixing notifications from LOOKING peers and FOLLOWING/LEADING peers is the following. Peers that are either LEADING or FOLLOWING won't change their leader state based on new notifications. If a peer receives notifications from a quorum and one of the notifications says FOLLOWING/LEADING, then the peer cannot count on the sender of the notification to move to a new leader. The election will fail and the peer (perhaps along with others) will have to start over. Also, in the case a peer gets disconnected and comes back, if there is a quorum following a leader when the peer comes back, it needs to be able to determine who the leader is even if its leader election round is higher than the others. In this case, it has to follow a leader based on FOLLOWING/LEADING notifications, without comparing the votes in the notifications. For these reasons, we have kept FOLLOWING/LEADING notifications in a separate data structure. It is true, though, that we could keep all of them in the same structure and simply filter based on the state field when processing. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011331#comment-13011331 ] Vishal K commented on ZOOKEEPER-975: correction: "I wasn't sure about that change either" > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011330#comment-13011330 ] Vishal K commented on ZOOKEEPER-975: {quote} 1. It does not have a test, but I'm not entirely convinced we should try to implement one, since it might be complex. We should think about it, though; {quote} Yes, I didn't include a test. We should have a test, but I don't think I can get to it in the next couple of days or so. I was thinking of feeding the recvQueue with dummy Notifications for scenarios that we wanted to test and then call FastLeaderElection.lookForLeader() and verify the outcome. The tricky part is to come-up with the right sequence of Notifications to test all corner cases. {quote} 2. I don't understand why you're removing the outofelection data structure . I believe the notifications from peers that are FOLLOWING/LEADING should be treated separately from the the ones of peers that are LOOKING. If a peer obtains notifications from a quorum saying that the peers are LEADING/FOLLOWING, then it should follow the leader they point to. If a peer obtains notifications from a quorum of FOLLOWING peers, then it should follow the protocol to select a leader. Consequently, the notifications should be treated separately. {quote} I was sure about that change either. I don't mind reintroducing outofelection. But I couldn't think of scenarios where inserting Notification in recvset instead of outofelection will be a problem. Considering that termPredicate verifies that majority vote for same , can you describe a scenario where this change could cause a problem? > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for >
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011244#comment-13011244 ] Flavio Junqueira commented on ZOOKEEPER-975: Hi Vishal, Here are two high-level comments: # It does not have a test, but I'm not entirely convinced we should try to implement one, since it might be complex. We should think about it, though; # I don't understand why you're removing the outofelection data structure. I believe the notifications from peers that are FOLLOWING/LEADING should be treated separately from the the ones of peers that are LOOKING. If a peer obtains notifications from a quorum saying that the peers are LEADING/FOLLOWING, then it should follow the leader they point to. If a peer obtains notifications from a quorum of FOLLOWING peers, then it should follow the protocol to select a leader. Consequently, the notifications should be treated separately. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010989#comment-13010989 ] Hadoop QA commented on ZOOKEEPER-975: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12474564/ZOOKEEPER-975.patch2 against trunk revision 1082362. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/199//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/199//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/199//console This message is automatically generated. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch, > ZOOKEEPER-975.patch2 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010689#comment-13010689 ] Flavio Junqueira commented on ZOOKEEPER-975: Hi Vishal, I'll have a look at the patch once you generate a new one that applies. On the last comment you posted, I wonder why you think it would be better to update the last zxid according to the state of the server. One problem I see is that only the leader can really maintain its own last zxid up to date. The other servers don't really know how far the leader has gone. Also, the idea of maintaining the last vote intact is to keep the pair used to decide upon the current leader, and currently we don't use the zxid field to determine leadership if the notification says LEADING or FOLLOWING. Is there anything I'm missing? > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010522#comment-13010522 ] Vishal K commented on ZOOKEEPER-975: Note there may be another problem in FLE unreleated to this bug. Originally, in QCM instead of sending lastMessage I wanted to send a notification with current state. But I noticed that a peer does not send its real current state in FastLeaderElection.java even when it is not LOOKING. Heres the relevant code: /* * If this server is not looking, but the one that sent the ack * is looking, then send back what it believes to be the leader. */ Vote current = self.getCurrentVote(); if(ackstate == QuorumPeer.ServerState.LOOKING){ if(LOG.isDebugEnabled()){ LOG.debug("Sending new notification. My id = " + self.getId() + ", Recipient = " + response.sid + " zxid =" + current.zxid + " leader=" + current.id); } ToSend notmsg = new ToSend( ToSend.mType.notification, current.id, current.zxid,< zxid of current vote logicalclock, self.getPeerState(), response.sid); sendqueue.offer(notmsg); } It sends zxid of the current vote and not the current zxid seen by the server. The zxid of the vote can be stale. So I dropped that change from QCM changed the size of sendQueue to 1. I think fixing the above code is a separate issue and should be done later, if needed. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758
[jira] [Commented] (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010520#comment-13010520 ] Hadoop QA commented on ZOOKEEPER-975: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12474451/ZOOKEEPER-975.patch against trunk revision 1082362. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/195//console This message is automatically generated. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K >Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch, ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
"Vishal K (JIRA)" wrote: [ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004273#comment-13004273 ] Vishal K commented on ZOOKEEPER-975: Hi Flavio, I have a patch for this, but I have it on the top of the fix for ZOOKEEPER-932. We have 932 applied to our ZK code since we need it. Until ZOOKEEPER-932 is reviewed and committed, I will have to keep back porting patches (and do double testing). I will port my changes to trunk if someone requires a fix for the bug. Since this is not a blocker, I am going to hold off the patch until 932 is reviewed. That will reduce my testing and porting overhead. Does that sound ok? The patch I have is good only for FLE. {quote} About maintenance, we have some time back talked about maintaining only the TCP version of FLE (FLE+QCM). There was never some real pressure to eliminate the others, and in fact previously some users were still using LE. I'm all for maintaining only FLE, but we need to hear the opinion of others. More thoughts? {quote} The documentation says: "The implementations of leader election 1 and 2 are currently not supported, and we have the intention of deprecating them in the near future. Implementations 0 and 3 are currently supported, and we plan to keep supporting them in the near future. To avoid having to support multiple versions of leader election unecessarily, we may eventually consider deprecating algorithm 0 as well, but we will plan according to the needs of the community." Is there a significant advantage of using LE 0 vs LE 3? > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B s
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004273#comment-13004273 ] Vishal K commented on ZOOKEEPER-975: Hi Flavio, I have a patch for this, but I have it on the top of the fix for ZOOKEEPER-932. We have 932 applied to our ZK code since we need it. Until ZOOKEEPER-932 is reviewed and committed, I will have to keep back porting patches (and do double testing). I will port my changes to trunk if someone requires a fix for the bug. Since this is not a blocker, I am going to hold off the patch until 932 is reviewed. That will reduce my testing and porting overhead. Does that sound ok? The patch I have is good only for FLE. {quote} About maintenance, we have some time back talked about maintaining only the TCP version of FLE (FLE+QCM). There was never some real pressure to eliminate the others, and in fact previously some users were still using LE. I'm all for maintaining only FLE, but we need to hear the opinion of others. More thoughts? {quote} The documentation says: "The implementations of leader election 1 and 2 are currently not supported, and we have the intention of deprecating them in the near future. Implementations 0 and 3 are currently supported, and we plan to keep supporting them in the near future. To avoid having to support multiple versions of leader election unecessarily, we may eventually consider deprecating algorithm 0 as well, but we will plan according to the needs of the community." Is there a significant advantage of using LE 0 vs LE 3? > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993551#comment-12993551 ] Flavio Junqueira commented on ZOOKEEPER-975: Hi Vishal, I'm under the impression that getting the current state only works for LEADING or FOLLOWING. While looking, we would need to return the latest vote, yes? Overall, it is sound to return the current state instead of keeping the latest message. About maintenance, we have some time back talked about maintaining only the TCP version of FLE (FLE+QCM). There was never some real pressure to eliminate the others, and in fact previously some users were still using LE. I'm all for maintaining only FLE, but we need to hear the opinion of others. More thoughts? > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12993428#comment-12993428 ] Vishal K commented on ZOOKEEPER-975: Hi Flavio, Do you agree that sending the current state is a better approach? I as looking at the code to add a method createNotificationMessage() that will return a notification message based on the current state of the server. QCM can call this method and send the notification message at the place where we send lastMessage. However, adding it to FastLeaderElection only wont be enough since there are other leader election algorithms as well. The Election interface is very minimal. Should we add this method to this interface? Which leader election algorithms do we support and is it ok to make this change only for the supported algorithms? Thanks -Vishal > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991939#comment-12991939 ] Vishal K commented on ZOOKEEPER-975: Hi Flavio, Re-adding lastMessage results in the delay reported in this bug. When node 2 attempts to join a cluster of node 1 and 0, nodes 1 and 0 send a "LOOKING" notification followed by a "FOLLOWING"/"LEADING" notification. After receiving the first pair of LOOKING notifications, 2 goes in the LEADING state. I think I have a better idea. Instead of sending lastMessage, how about we send the notification for the current state of the peer (just like we do in FastLeaderElection.java)? I think this will resolve both the problems. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991486#comment-12991486 ] Flavio Junqueira commented on ZOOKEEPER-975: Hi Vishal, Sounds like a good way to patch it. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991288#comment-12991288 ] Vishal K commented on ZOOKEEPER-975: Hi Flavio, ok, I will re-add lastMessage. How about before sending lastMessage() we check if the send queue is empty? If it is empty, then we send lastMesseage, otherwise, we send the message from the queue. This will avoid sending stale messages to the peer. So the final fix will: 1. Change CAPACITY to 1. 2. send lastMessage to A only if the send queue for A is empty. This will ensure that we send the most recent notification and we also handle lost message as before. Does this sound ok? > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990955#comment-12990955 ] Flavio Junqueira commented on ZOOKEEPER-975: Hi Vishal, This is a general concurrency issue that arises when using TCP connections, and I'm not sure we can avoid it even though our protocol is in principle resilient to such message losses. Back to the TCP issue I was mentioning, we remove the message from the queue of messages to send, and send it with a write call to the socket. However, it is not guaranteed that the message will get through, since the destination might drop the connection while the message is in transit. The sender has no way to know if it really got through, unless we have an acknowledgment scheme on top of TCP, which sounds overkill. The way we found to avoid the case in one of the jiras I mentioned above was to resend the last message the peer dequeued for a given destination. This is exactly the mechanism you're proposing to remove. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990064#comment-12990064 ] Vishal K commented on ZOOKEEPER-975: Hi Flavio, Can you describe to me this problem? I will see if the problem still exists if this patch is applied after applying patch for ZOOKEEPER-932. Thanks. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987517#action_12987517 ] Vishal K commented on ZOOKEEPER-975: Hi Flavio, Do you think that this will be a problem even after we have the patch for ZOOKEEPER-932? This is what ZOOKEEPER-475 describes: -- * Replica 1 sends a message to itself and to Replica 2 stating that its current vote is for replica 1; * Replica 2 sends a message to itself and to Replica 1 stating that its current vote is for replica 2; * Replica 1 updates its vote, and sends a message to itself stating that its current vote is for replica 2; * Since replica 1 has two votes for 2 in a an ensemble of 3 replicas, replica 1 decides to follow 2. The problem is that replica 2 does not receive a message from 1 stating that it changed its vote to 2, which prevents 2 from becoming a leader. Now looking more carefully at why that happened, you can see that when 1 tries to send a message to 2, QuorumCnxManager in 1 is both shutting down a connection to 2 at the same time that it is trying to open a new one. The incorrect synchronization prevents the creation of a new connection, and 1 and 2 end up not connected. -- We no longer have incorrect synchronization. We can have QCM in 1 shutting down the connection to 2 while it is trying to send a notification to 2. However, the only time 1 will shutdown a connection to 2 is when it receives a new connection request from 2 (or when something is wrong with the connection). A new connection request is received when 2 is trying to send a notification to 1. As a result, 1 will end up sending a notification to 2 saying that it is following 2. Do you agree? > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987494#action_12987494 ] Flavio Junqueira commented on ZOOKEEPER-975: Hi VIshal, The main problem I see is that the patch you propose removes lastMessageSent( ZOOKEEPER-481), which was inserted to deal with a problem observed in ZOOKEEPER-475, and also discussed in ZOOKEEPER-480. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-975.patch > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984722#action_12984722 ] Vishal K commented on ZOOKEEPER-975: Hi Flavio, What is the motivation to send the history of notifications to the joining peer? Shouldn't the most recent notification (or just the current state) be enough? I understand this is a performance issue. However, I think it is a sizeable hole. - There could have been multiple leader elections while the node is down and the node could end up hopping across leaders until it gets to the correct leader. - Suppose, we have a 3 node cluster. I have a simple client which connects to A and creates a znode_A to indicate that A (and the client) is online. The leader A disconnects from B and C and causes C to take leadership. Now, when A is trying to join the cluster, it can be unnecessarily delayed due to this bug. If I have an application that takes some action if znode_A is unavailable, then this bug can unnecessarily trigger that action. We are facing this problem in our application. I think it will be a small change to QCM. What do you think? > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-975) new peer goes in LEADING state even if ensemble is online
[ https://issues.apache.org/jira/browse/ZOOKEEPER-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983331#action_12983331 ] Flavio Junqueira commented on ZOOKEEPER-975: Thanks for bringing this up, Vishal. This is not a new observation, although I can't remember if we discussed it in a jira or not. In general, I'm lukewarm about this change. It is certainly not an issue to avoid the server going into LEADING before it goes correctly into LOOKING, but I'm not entirely comfortable with manipulating the queues of notifications. Being able to have two servers concurrently thinking they are leading is a situation supported by our protocols, and such a modification would be an optimization to avoid the unnecessary LEADING step. Regarding application recovery time, we don't have a load balance scheme at this point, which could be quite useful, so bringing a new follower up does not guarantee that clients will move their sessions to the new follower. Note that this situation occurs only if there is an ensemble running and a server joins or recovers. > new peer goes in LEADING state even if ensemble is online > - > > Key: ZOOKEEPER-975 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-975 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.3.2 >Reporter: Vishal K > Fix For: 3.4.0 > > > Scenario: > 1. 2 of the 3 ZK nodes are online > 2. Third node is attempting to join > 3. Third node unnecessarily goes in "LEADING" state > 4. Then third goes back to LOOKING (no majority of followers) and finally > goes to FOLLOWING state. > While going through the logs I noticed that a peer C that is trying to > join an already formed cluster goes in LEADING state. This is because > QuorumCnxManager of A and B sends the entire history of notification > messages to C. C receives the notification messages that were > exchanged between A and B when they were forming the cluster. > In FastLeaderElection.lookForLeader(), due to the following piece of > code, C quits lookForLeader assuming that it is supposed to lead. > 740 //If have received from all nodes, then > terminate > 741 if ((self.getVotingView().size() == > recvset.size()) && > 742 > (self.getQuorumVerifier().getWeight(proposedLeader) != 0)){ > 743 self.setPeerState((proposedLeader == > self.getId()) ? > 744 ServerState.LEADING: > learningState()); > 745 leaveInstance(); > 746 return new Vote(proposedLeader, > proposedZxid); > 747 > 748 } else if (termPredicate(recvset, > This can cause: > 1. C to unnecessarily go in LEADING state and wait for tickTime * initLimit > and then restart the FLE. > 2. C waits for 200 ms (finalizeWait) and then considers whatever > notifications it has received to make a decision. C could potentially > decide to follow an old leader, fail to connect to the leader, and > then restart FLE. See code below. > 752 if (termPredicate(recvset, > 753 new Vote(proposedLeader, proposedZxid, > 754 logicalclock))) { > 755 > 756 // Verify if there is any change in the > proposed leader > 757 while((n = recvqueue.poll(finalizeWait, > 758 TimeUnit.MILLISECONDS)) != null){ > 759 if(totalOrderPredicate(n.leader, > n.zxid, > 760 proposedLeader, > proposedZxid)){ > 761 recvqueue.put(n); > 762 break; > 763 } > 764 } > In general, this does not affect correctness of FLE since C will > eventually go back to FOLLOWING state (A and B won't vote for > C). However, this delays C from joining the cluster. This can in turn > affect recovery time of an application. > Proposal: A and B should send only the latest notification (most > recent) instead of the entire history. Does this sound reasonable? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.