[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2018-11-21 Thread Michael K. Edwards (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695365#comment-16695365
 ] 

Michael K. Edwards commented on ZOOKEEPER-1865:
---

Is this reproducible in current 3.5?

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
>Priority: Major
> Fix For: 3.6.0, 3.5.5
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, 
> ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-03-15 Thread Michi Mutsuzaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362598#comment-14362598
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-1865:


It failed on trunk: https://builds.apache.org/job/ZooKeeper-trunk/2629/

The failure might not be related to this patch. I just ran the test without 
this patch and it still failed. It looks like the test is timing dependent. It 
fails if 2 snapshots don't finish within 200ms.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1, 3.6.0
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, 
> ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-03-15 Thread Camille Fournier (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362592#comment-14362592
 ] 

Camille Fournier commented on ZOOKEEPER-1865:
-

[~michim] is this failing on trunk or precommit builds? Can you point to the 
build that you saw fail besides your local?

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1, 3.6.0
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, 
> ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-03-15 Thread Camille Fournier (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362585#comment-14362585
 ] 

Camille Fournier commented on ZOOKEEPER-1865:
-

Not sure, I can't get it to fail but let me look.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1, 3.6.0
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, 
> ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-03-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362341#comment-14362341
 ] 

Hudson commented on ZOOKEEPER-1865:
---

FAILURE: Integrated in ZooKeeper-trunk #2629 (See 
[https://builds.apache.org/job/ZooKeeper-trunk/2629/])
ZOOKEEPER-1865 Fix retry logic in Learner.connectToLeader() (Edward Carter via 
michim) (michim: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1666784)
* /zookeeper/trunk/CHANGES.txt
* /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Learner.java
* 
/zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/LearnerTest.java


> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1, 3.6.0
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, 
> ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-03-15 Thread Michi Mutsuzaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362282#comment-14362282
 ] 

Michi Mutsuzaki commented on ZOOKEEPER-1865:


+1

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1, 3.6.0
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, 
> ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-03-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362117#comment-14362117
 ] 

Hadoop QA commented on ZOOKEEPER-1865:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12704626/ZOOKEEPER-1865-testfix.patch
  against trunk revision 1666764.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2569//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2569//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2569//console

This message is automatically generated.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, 
> ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-03-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362116#comment-14362116
 ] 

Hadoop QA commented on ZOOKEEPER-1865:
--

+1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12704626/ZOOKEEPER-1865-testfix.patch
  against trunk revision 1666764.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2568//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2568//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2568//console

This message is automatically generated.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, 
> ZOOKEEPER-1865-testfix.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-03-14 Thread Camille Fournier (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362102#comment-14362102
 ] 

Camille Fournier commented on ZOOKEEPER-1865:
-

So, the test as written does not actually exhibit the error we're fixing; if we 
revert the meaningful change you've proposed to Learner it will still pass. I 
updated it a bit to get it to fail with the (mostly) old code (modulo some 
helper methods you wrote), and pass with the new code. Have attached. [~michim] 
if you have a chance to look at this quickly it would be nice to get this into 
3.5.1

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-03-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362090#comment-14362090
 ] 

Hadoop QA commented on ZOOKEEPER-1865:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12695662/ZOOKEEPER-1865-nanoTime.patch
  against trunk revision 1666760.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2566//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2566//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2566//console

This message is automatically generated.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-03-14 Thread Camille Fournier (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362071#comment-14362071
 ] 

Camille Fournier commented on ZOOKEEPER-1865:
-

I'm going to retrigger a build. Can't believe this patch has been open for a 
year...

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-01-30 Thread Jared Cantwell (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299516#comment-14299516
 ] 

Jared Cantwell commented on ZOOKEEPER-1865:
---

Are the core unit tests stable?  I don't believe either of the failures were 
caused by my patch (there would be a specific new exception printed in the 
logs).

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-01-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14299499#comment-14299499
 ] 

Hadoop QA commented on ZOOKEEPER-1865:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12695662/ZOOKEEPER-1865-nanoTime.patch
  against trunk revision 1655910.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2499//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2499//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2499//console

This message is automatically generated.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865-nanoTime.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-01-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294516#comment-14294516
 ] 

Hadoop QA commented on ZOOKEEPER-1865:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12694892/ZOOKEEPER-1865-nanoTime-noUT.patch
  against trunk revision 1655082.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2496//console

This message is automatically generated.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865-nanoTime-noUT.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-01-27 Thread Jared Cantwell (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294503#comment-14294503
 ] 

Jared Cantwell commented on ZOOKEEPER-1865:
---

Camille, we didn't like the use of currentTimeMillis because its not safe 
against time jumps and we've had problems with that in the past, so I'm 
thinking of polishing up the patch I just attached that uses System.nanoTime 
instead.  What do you think of that approach?

Do you have suggestions for some good tests that can leverage the nanoTime 
overridable method without further poking into the internals of 
connectToLeader?  Or were you thinking we should use it in already existing 
tests?

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865-nanoTime-noUT.patch, ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-01-16 Thread Jared Cantwell (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14280916#comment-14280916
 ] 

Jared Cantwell commented on ZOOKEEPER-1865:
---

We just hit this in our internal testing today too.

connect() throws a SocketTimeoutException if the specified timeout was reached. 
 This isn't perfect, but could this be leveraged to assume connect was "fast" 
if that exception wasn't thrown, and it was "slow" otherwise?  Unfortunately, 
if connect() takes just under the timeout to throw a different error, then 
we'll "lose" that time.  Probably not ideal, but wanted to suggest it as an 
option.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-01-11 Thread Camille Fournier (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273122#comment-14273122
 ] 

Camille Fournier commented on ZOOKEEPER-1865:
-

I'm not super crazy jazzed with all the inline calls to 
System.currentTimeMillis tbh. Feels like it will be a nightmare to test. Why 
not make it an overridable method to check this invariant?

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-01-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264368#comment-14264368
 ] 

Hadoop QA commented on ZOOKEEPER-1865:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12654667/ZOOKEEPER-1865.patch
  against trunk revision 1646992.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2475//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2475//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2475//console

This message is automatically generated.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2015-01-05 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14264314#comment-14264314
 ] 

Rakesh R commented on ZOOKEEPER-1865:
-

bq. Are there plans to require Java 7 or later for Zookeeper in the near future?
[~ecarter], since ZOOKEEPER-1963 is in, we can go ahead with this. 
BTW could you tell me the reason for (self.initLimit - self.syncLimit). Also, 
there could be chance of self.syncLimit > self.initLimit and evaluate to 
negative integer?

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.1
>
> Attachments: ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2014-07-21 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069826#comment-14069826
 ] 

Patrick Hunt commented on ZOOKEEPER-1865:
-

Can we address this issue w/o jdk7?

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.0, 3.5.1
>
> Attachments: ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2014-07-09 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056715#comment-14056715
 ] 

Flavio Junqueira commented on ZOOKEEPER-1865:
-

I don't think we have made plans to have such a requirement, but you can 
propose and see what others think. I don't object.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.0, 3.5.1
>
> Attachments: ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2014-07-09 Thread Edward Carter (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056672#comment-14056672
 ] 

Edward Carter commented on ZOOKEEPER-1865:
--

The new unit test in the patch failed to build on Jenkins because it uses 
InetAddress.getLoopbackAddress(), which was introduced in Java 7.  Are there 
plans to require Java 7 or later for Zookeeper in the near future?

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.0, 3.5.1
>
> Attachments: ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

2014-07-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055477#comment-14055477
 ] 

Hadoop QA commented on ZOOKEEPER-1865:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12654667/ZOOKEEPER-1865.patch
  against trunk revision 1608872.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The patch appears to cause tar ant target to fail.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2181//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2181//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2181//console

This message is automatically generated.

> Fix retry logic in Learner.connectToLeader() 
> -
>
> Key: ZOOKEEPER-1865
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Thawan Kooburat
>Assignee: Edward Carter
> Fix For: 3.5.0, 3.5.1
>
> Attachments: ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. 
> So 3 out 5 (including the old leader) elected the old leader to be a new 
> leader for the next epoch. While, the old leader is being rebooted, 2 other 
> machines are trying to connect to the old leader.  So the quorum couldn't 
> form until those 2 machines give up and move to the next round of leader 
> election.
> This is because Learner.connectToLeader() use a simple retry logic. The 
> contract for this method is that it should never spend longer that initLimit 
> trying to connect to the leader.  In our outage, each sock.connect() is 
> probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.2#6252)