[ https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362117#comment-14362117 ]
Hadoop QA commented on ZOOKEEPER-1865: -------------------------------------- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12704626/ZOOKEEPER-1865-testfix.patch against trunk revision 1666764. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2569//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2569//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2569//console This message is automatically generated. > Fix retry logic in Learner.connectToLeader() > --------------------------------------------- > > Key: ZOOKEEPER-1865 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865 > Project: ZooKeeper > Issue Type: Bug > Components: server > Reporter: Thawan Kooburat > Assignee: Edward Carter > Fix For: 3.5.1 > > Attachments: ZOOKEEPER-1865-nanoTime.patch, > ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch > > > We discovered a long leader election time today in one of our prod ensemble. > Here is the description of the event. > Before the old leader goes down, it is able to announce notification message. > So 3 out 5 (including the old leader) elected the old leader to be a new > leader for the next epoch. While, the old leader is being rebooted, 2 other > machines are trying to connect to the old leader. So the quorum couldn't > form until those 2 machines give up and move to the next round of leader > election. > This is because Learner.connectToLeader() use a simple retry logic. The > contract for this method is that it should never spend longer that initLimit > trying to connect to the leader. In our outage, each sock.connect() is > probably blocked for initLimit and it is called 5 times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)