[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055477#comment-14055477
 ] 

Hadoop QA commented on ZOOKEEPER-1865:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12654667/ZOOKEEPER-1865.patch
  against trunk revision 1608872.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The patch appears to cause tar ant target to fail.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

    +1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2181//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2181//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2181//console

This message is automatically generated.

> Fix retry logic in Learner.connectToLeader() 
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-1865
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>            Reporter: Thawan Kooburat
>            Assignee: Edward Carter
>             Fix For: 3.5.0, 3.5.1
>
>         Attachments: ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to