[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-02-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859061#comment-15859061
 ] 

Hudson commented on ZOOKEEPER-2080:
---

SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #3269 (See 
[https://builds.apache.org/job/ZooKeeper-trunk/3269/])
ZOOKEEPER-2080: Fix deadlock in dynamic reconfiguration (rakeshr: rev 
434a7eef271fab02306bcc9c8ad29ec2fe2e)
* (edit) src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java
* (edit) src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java


> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-02-08 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859052#comment-15859052
 ] 

Rakesh R commented on ZOOKEEPER-2080:
-

Thanks [~hanm] for the good work and others for the useful review comments.

Committed to branch-3.5:
https://git-wip-us.apache.org/repos/asf?p=zookeeper.git;a=commit;h=ac864f53bf7d1f3c8630348ed4cdeb2539ec3932

Committed to master:
https://git-wip-us.apache.org/repos/asf?p=zookeeper.git;a=commit;h=434a7eef271fab02306bcc9c8ad29ec2fe2e

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859025#comment-15859025
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2080:
---

Github user asfgit closed the pull request at:

https://github.com/apache/zookeeper/pull/92


> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-02-08 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858792#comment-15858792
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

This looks good, thanks Michael. I'm still +1.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-02-08 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858789#comment-15858789
 ] 

Michael Han commented on ZOOKEEPER-2080:


Thanks Alex. I made a small change to address your last comment:
bq. Instead of passing lastSeenQV as a parameter, could you add a line here 
Map lastProposedView = 
self.getLastSeenQuorumVerifier().getAllMembers();

and the change is here fyi. 
https://github.com/apache/zookeeper/pull/92/commits/25f0caf15a0992b3e871f1fa5c7532f78c949037

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-02-08 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858783#comment-15858783
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

Hi Michael, Nothing changed in the code since my last comment right? so I'm 
still +1.


> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858773#comment-15858773
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2080:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/92
  
Hey @shralex, want to close the loop here. Does the latest pull request 
look good to you? My stress tests on this branch has been green for past two 
weeks.



> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-01-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834049#comment-15834049
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2080:
---

Github user shralex commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/92#discussion_r97266698
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java ---
@@ -468,31 +469,33 @@ synchronized private boolean connectOne(long sid, 
InetSocketAddress electionAddr
  */
 
 synchronized void connectOne(long sid){
+connectOne(sid, self.getLastSeenQuorumVerifier());
+}
+
+synchronized void connectOne(long sid, QuorumVerifier lastSeenQV){
 if (senderWorkerMap.get(sid) != null) {
- LOG.debug("There is a connection already for server " + sid);
- return;
+LOG.debug("There is a connection already for server " + sid);
+return;
 }
-synchronized(self) {
-   boolean knownId = false;
-// Resolve hostname for the remote server before attempting to
-// connect in case the underlying ip address has changed.
-self.recreateSocketAddresses(sid);
-if (self.getView().containsKey(sid)) {
-   knownId = true;
-if (connectOne(sid, self.getView().get(sid).electionAddr))
-   return;
-} 
-if (self.getLastSeenQuorumVerifier()!=null && 
self.getLastSeenQuorumVerifier().getAllMembers().containsKey(sid)
-   && (!knownId || 
(self.getLastSeenQuorumVerifier().getAllMembers().get(sid).electionAddr !=
-   self.getView().get(sid).electionAddr))) {
-   knownId = true;
-if (connectOne(sid, 
self.getLastSeenQuorumVerifier().getAllMembers().get(sid).electionAddr))
-   return;
-} 
-if (!knownId) {
-LOG.warn("Invalid server id: " + sid);
+boolean knownId = false;
+// Resolve hostname for the remote server before attempting to
+// connect in case the underlying ip address has changed.
+self.recreateSocketAddresses(sid);
+if (self.getView().containsKey(sid)) {
--- End diff --

How about passing also the last committed view so you don't need to call 
getView() multiple times ?
I know you're protecting this with the lock in QuorumPeer, but before I 
read the other file I thought there may be a race because of multiple accesses 
to the config. A comment would help here.




> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-01-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830406#comment-15830406
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2080:
---

Github user eribeiro commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/92#discussion_r96929565
  
--- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java 
---
@@ -434,6 +434,10 @@ public int getQuorumSize(){
 //last proposed quorum verifier
 public QuorumVerifier lastSeenQuorumVerifier = null;
 
+// Lock object that guard access to quorumVerifier and 
lastSeenQuorumVerifier.
+private byte[] qvLock = new byte[0];
--- End diff --

It should be *final* if you want to use it as a lock object. Also, as 
constant it would be renamed as *QV_LOCK*.

nit: Also, we usually use a `Object`:
```
private final Object LOCK = new Object();
```
or a `ReentrantReadWriteLock` (more verbose tough, with try-finally, etc).


> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-01-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830354#comment-15830354
 ] 

Hadoop QA commented on ZOOKEEPER-2080:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12834574/ZOOKEEPER-2080.patch
  against trunk revision 42c75b5f2457f8ea5b4106ce5dc1c34c330361c0.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3563//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3563//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3563//console

This message is automatically generated.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-01-19 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830331#comment-15830331
 ] 

Michael Han commented on ZOOKEEPER-2080:


Update... current status of this issue:
* Rebased against latest master. 
* My stress tests on failed unit tests all pass.
* Code reviewed by @lvfangmin on github.

[~shralex][~fpj] Could you do a review on [Pull Request 
92|https://github.com/apache/zookeeper/pull/92] when you have a chance? 

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2017-01-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830313#comment-15830313
 ] 

Hadoop QA commented on ZOOKEEPER-2080:
--

+1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/225//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/225//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/225//console

This message is automatically generated.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-12-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722843#comment-15722843
 ] 

Hadoop QA commented on ZOOKEEPER-2080:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/100//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/100//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/100//console

This message is automatically generated.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-12-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722835#comment-15722835
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2080:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/92#discussion_r90916712
  
--- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java 
---
@@ -1390,24 +1406,29 @@ public QuorumVerifier configFromString(String s) 
throws IOException, ConfigExcep
 }
 
 /**
- * Return QuorumVerifier object for the last committed configuration
+ * Return QuorumVerifier object for the last committed configuration.
  */
-
-public synchronized QuorumVerifier getQuorumVerifier(){
-return quorumVerifier;
-
+public QuorumVerifier getQuorumVerifier(){
+synchronized (qvLock) {
+return quorumVerifier;
+}
 }
 
-public synchronized QuorumVerifier getLastSeenQuorumVerifier(){
-return lastSeenQuorumVerifier;
+/**
+ * Return QuorumVerifier object for the last proposed configuration.
+ */
+public QuorumVerifier getLastSeenQuorumVerifier(){
+synchronized (qvLock) {
+return lastSeenQuorumVerifier;
+}
 }
 
-public synchronized void connectNewPeers(){
-   if (qcm!=null && getQuorumVerifier()!=null && 
getLastSeenQuorumVerifier()!=null) {
-   Map committedView = 
getQuorumVerifier().getAllMembers();
-   for (Entry e: 
getLastSeenQuorumVerifier().getAllMembers().entrySet()){
+private void connectNewPeers(){
--- End diff --

@lvfangmin Good suggestion, PR updated. Thanks for double checking. 


> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-12-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15718771#comment-15718771
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2080:
---

Github user lvfangmin commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/92#discussion_r90764675
  
--- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java 
---
@@ -1390,24 +1406,29 @@ public QuorumVerifier configFromString(String s) 
throws IOException, ConfigExcep
 }
 
 /**
- * Return QuorumVerifier object for the last committed configuration
+ * Return QuorumVerifier object for the last committed configuration.
  */
-
-public synchronized QuorumVerifier getQuorumVerifier(){
-return quorumVerifier;
-
+public QuorumVerifier getQuorumVerifier(){
+synchronized (qvLock) {
+return quorumVerifier;
+}
 }
 
-public synchronized QuorumVerifier getLastSeenQuorumVerifier(){
-return lastSeenQuorumVerifier;
+/**
+ * Return QuorumVerifier object for the last proposed configuration.
+ */
+public QuorumVerifier getLastSeenQuorumVerifier(){
+synchronized (qvLock) {
+return lastSeenQuorumVerifier;
+}
 }
 
-public synchronized void connectNewPeers(){
-   if (qcm!=null && getQuorumVerifier()!=null && 
getLastSeenQuorumVerifier()!=null) {
-   Map committedView = 
getQuorumVerifier().getAllMembers();
-   for (Entry e: 
getLastSeenQuorumVerifier().getAllMembers().entrySet()){
+private void connectNewPeers(){
--- End diff --

Hi @hanm, I've followed the long discussion on the Jira, thanks for digging 
deeper into the problem. LGTM, only a small suggestion: it's error-prone to 
assume that in the future this method won't be called by other non-synchronized 
methods, we'd better to add the synchronize(qvLock) here too.


> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-10-27 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613237#comment-15613237
 ] 

Ted Yu commented on ZOOKEEPER-2080:
---

Thanks for the effort, Michael.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593437#comment-15593437
 ] 

Hadoop QA commented on ZOOKEEPER-2080:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12834574/ZOOKEEPER-2080.patch
  against trunk revision cef5978969bedfe066f903834a9ea4af6d508844.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//console

This message is automatically generated.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-10-20 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593405#comment-15593405
 ] 

Michael Han commented on ZOOKEEPER-2080:


I've also left some comments on PR about the refactoring. The patch is stress 
tested on internal Jenkins with the previous failed tests, and so far all tests 
passed (with 500+ runs).

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-10-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593386#comment-15593386
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2080:
---

GitHub user hanm opened a pull request:

https://github.com/apache/zookeeper/pull/92

ZOOKEEPER-2080: Fix deadlock in dynamic reconfiguration.

Use explicit fine grained locks for synchronizing access to QuorumVerifier 
states in QuorumPeer.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hanm/zookeeper ZOOKEEPER-2080

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/92.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #92


commit cfb4d73a94f23c6e047f75d400372caa495f197f
Author: Michael Han 
Date:   2016-10-20T22:37:22Z

Use explicit fine grained lock for maintaining a consistent global view of 
QuorumVerifier in QuorumPeer.




> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-10-20 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593380#comment-15593380
 ] 

Michael Han commented on ZOOKEEPER-2080:


Getting back on this issue. I have another perspective - rather than 
synchronizing on entire QuorumPeer we could use fine grained locks to 
specifically guard the set of states affiliated with QuorumVerifier, namely:
* Quorum verifier.
* Last seen quorum verifier.
* Some additional fields that appertain to QV and last QV, such as quorum 
address, election port, and client address.

As for what to guard, I feel guard on individual fields (QV, last QV, quorum 
address, etc) is enough after examining the code paths and call graphs - if I 
am wrong, then we will need to figure out a compound state and invent a new 
class that encapsulates all these states. Also we should make sure that every 
access to these states will be guarded by the same lock. 

bq. But this update would have to synchronize on both QP and QCM, so not sure 
if the same problem exists as above.
I feel the same way - this solution just moves the lock around =).

bq. Do we need to update all view listeners in a critical section?
I think yes, if we want to maintain a global consistent view of the QV states. 
When the value of QV in QP is updated, QP needs to fire events to notify 
subscribers of such change, and without locking when the event hits subscriber 
the view of the world might already changed. We might also need locking on the 
event callbacks of each listener, to avoid any potential race condition of each 
listener. But this case the likely hood of running into race / dead lock sounds 
lower as we are not synchronizing on a single QP anymore.

In any cases, I feel refactoring the QuorumPeer to use fine grained locking is 
necessary. Attaching a patch to express my idea.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-27 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526260#comment-15526260
 ] 

Flavio Junqueira commented on ZOOKEEPER-2080:
-

The only reason I had to suggest that we use this listener pattern is that we 
apparently need the view across a number of classes, so instead of having to 
have QP keeping direct references to each of these classes, we simply keep 
implementations of some {{ViewListener}} interface. It is unclear to me whether 
we need to synchronize on QP when invoking the interface call on each of the 
registered instances. Do we need to update all view listeners in a critical 
section?

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-26 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523711#comment-15523711
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

I see. Maybe in this case Flavio's suggestion would work, or even a simpler 
method where you keep a copy of the views in QCM and update them from the 
QuorumPeer methods that set the last committed and last proposed views. But 
this update would have to synchronize on both QP and QCM, so not sure if the 
same problem exists as above.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-26 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523638#comment-15523638
 ] 

Michael Han commented on ZOOKEEPER-2080:


Hi Alex, the constraint here is that we can't invoke any of the QuroumPeer's 
synchronized functions inside the listener thread of QuorumCnxManager, because 
when we restart leader election, the same QuorumPeer object is acquired first 
to kick off the restart, and the lock would not be released until restart 
finished - and the restart would not finish until listener thread joins. Due to 
timing we could end up in any places inside listener's run method, so if we try 
to synchronize on the same QuorumPeer we would get blocked indefinitely. 

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-26 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523553#comment-15523553
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

I agree about the cause, but my suggestion (above) may be easier - create one 
synchronized method that returns both committed and proposed configs together 
(though an unmodifiable collection sounds like a good idea).

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-24 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519026#comment-15519026
 ] 

Flavio Junqueira commented on ZOOKEEPER-2080:
-

I suspect that {{connectOne}} needs to synchronize on self because it needs a 
consistent view of {{self.getView()}} and {{self.getLastSeenQuorumVerifier()}}. 
In fact, one interesting thing is that {{self.getView()}} is declared as:

{noformat}
public Map getView() {
return Collections.unmodifiableMap(getQuorumVerifier().getAllMembers());
}
{noformat}

So all we really need is {{self.getLastSeenQuorumVerifier()}}.

The root cause of all these deadlocks seems to be that we are trying to get a 
consistent view of the ensemble and locking {{QuorumPeer}} to guarantee 
consistency. The complex interdependencies across classes is making it 
difficult to guarantee that we don't have deadlocks. My suggestion is that we 
take a different approach. Each class that needs a consistent view of 
{{self.getLastSeenQuorumVerifier()}} will implement a listener that caches the 
new value locally, and {{QuorumPeer}} will broadcast changes to the quorum 
verifier to all listeners. Broadcasting can be done under a lock to prevent 
races with other operations inside {{QuorumPeer}}. I think that if we do 
something like this, we will be avoiding the circular dependencies and fixing 
the deadlocks. The change doesn't seem to be super complex, but I could be 
wrong, though.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-23 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517870#comment-15517870
 ] 

Michael Han commented on ZOOKEEPER-2080:


Thanks a ton for your feedback Alex, and Flavio. Had more thoughts about this 
today. First to reply your comments posted earlier:

bq. If we pass the QV instance as part of the connectOne call, then can we 
remove the synchronized(self) block?
To pass QV instances we have to first retrieve them from a QuorumPeer object. 
Get QV instances on a QuorumPeer object requires synchronization on the 
QuorumPeer itself, either using existing methods or using what Alex was 
suggesting:
bq. To remove the synchronized block completely you could create a synchronized 
function in QuorumPeer that returns a pair

So in any case we will still possibly end up with a deadlock somewhere inside 
listener threads - we just moved the deadlock from one place to the other :) - 
because somewhere in the listener thread of QuorumCnxManager we are going to 
invoke synchronized version of getQV, which might block because the same 
QuorumPeer object which the getQV methods are invoking were acquired as a lock 
during restart of FLE.

To recap the deadlock problem - a simplified version description:
* Restart leader election acquires a QuorumPeer lock.
* Restart leader election requires shut down QuorumCnxManager as part of the 
process while holding the lock.
* Shutdown QuorumCnxManager requires synchronize on the same QuorumPeer.

The solution for each of these in order:
* Don't acquire a QuorumPeer lock when restart LE. This is less likely a 
solution due to the state changes to QuorumPeer during the process required the 
process has to be synchronized.
* Don't hold the lock while shutdown QuorumCnxManager. This is essentially the 
latest patch did by moving the shutdown out of the sync block. However, it does 
not fix the issue I spotted in previous comments (potential dead lock in 
QuorumPeer.restartLeaderElection - because this method synchronized implicitly 
on the intrinsic lock). 
* Don't synchronize on the QuorumPeer inside QuorumCnxManager's shutdown 
process: this implies we can't involve any of the QuorumPeer's synchronized 
methods in listener threads run loop - otherwise we will end up a deadlock. As 
previously commented, removing {{synchronized(self)}} block by passing QV 
instance to {{connectOne}} would not work because we still need to invoke the 
synchronized version of getQA somewhere in listener threads.

Other thoughts - we could possibly interrupt the connectOne which is 
synchronizing on the QuorumPeer. It should be fine to interrupt because at this 
point everything is shutting down and the server socket was already closed (in 
listener.halt) so no resource leak concerns. So we just need to find a way that 
after we call {{listener.halt()}} in {{QuorumCnxManager.halt()}}, we can signal 
the listener threads to bail out the blocking state - I need to think more 
about how to do this because it seems not possible to interrupt a synchronized 
code block - so we may have to use additional abstractions (i.e. 
ReentrantLock?) which adds more complexity.

 



> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-19 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504624#comment-15504624
 ] 

Flavio Junqueira commented on ZOOKEEPER-2080:
-

Here is a thought. {{connectOne}} is synchronized both on the 
{{QuorumCnxManager}} object and {{self}}, which is a {{QuorumPeer}} instance. 
If we pass the QV instance as part of the {{connectOne}} call, then can we 
remove the {{synchronized(self)}} block?

I think the method is synchronized because of accesses such as the one to 
{{sendWorkerMap}}.

cc [~hanm] [~shralex]



> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-15 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495437#comment-15495437
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

If I'm not mistaken, when the configuration changes we create a completely new 
QuorumVerifier object, so if you call the new function and have a reference to 
the old QV, the information in that object shouldn't change. But please make 
sure that this is the case and document this assumption if you choose to use 
this method.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-15 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495429#comment-15495429
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

There is a lot of accesses to the config info from this function, and the 
synchronization may be needed
to make sure all of them return a state which is consistent. getView returns 
the last committed
config and getLastSeenQuorumVerifier returns the last proposed config. If we 
wouldn't be in a synchronized block, and the committed view changed (reconfig 
commits while we're in this function), it seems possible that sid
moves from last proposed to last committed config and we will never try to 
connect to it.

First, getLastSeenQuorumVerifier and getView shouldn't be accessed more than 
once (currently 4 and 3 times each).
We can just store the result and use that. 

To remove the synchronized block completely you could create a synchronized 
function in QuorumPeer that returns a pair: last committed and last proposed 
view, and call that function once from connectOne.


> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-15 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494088#comment-15494088
 ] 

Michael Han commented on ZOOKEEPER-2080:


Regarding [~fpj]'s comment on 
bq. I'm wondering why QuorumCnxManager.connectOne is synchronized on self, all 
it is trying to do is to connect to another server.

The synchronization is introduced as part of ZOOKEEPER-107 
([reference|https://github.com/apache/zookeeper/commit/675ef85ba5cd97d69b875d14854b638da6f4423c#diff-1e15b8a38445193d635ac681fe37df67R409).
 I think the synchronization is required because this code check and react to 
state of QuorumPeer thus needs to synchronize with self as a whole. If we can 
get ride of the synchronization here, then both this JIRA and the potential 
deadlock spotted at {{QuorumPeer.restartLeaderElection}} can be fixed. I am not 
sure if that is feasible though.

[~shralex]: Could you shed some lights on the motivation of synchronizing on 
self in QuorumCnxManager.connectOne and also, any possibilities we can get ride 
of the synchronization here?

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-15 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494073#comment-15494073
 ] 

Michael Han commented on ZOOKEEPER-2080:


bq. The server will remain in the looking state, which will force it to try 
again. Do you agree?

Yes, the explanation makes sense to me. Previously I was not sure about this as 
I was concerning if skipping this will cause the following {{setCurrentVote}} 
fail, but now it looks like everything will converge. 

bq. I think this is the case I'm mentioning above, but let me know.

Agreed.

bq. A refactor of this code makes perfect sense to me.
Out of scope of this JIRA, but notes are taken :)

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-14 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490982#comment-15490982
 ] 

Flavio Junqueira commented on ZOOKEEPER-2080:
-

On the race involving {{shuttingDownLE}}, say that FLE sets it to true right 
after the {{QuorumPeer}} thread has checked that it is false here:

{noformat}
if (shuttingDownLE) {
 shuttingDownLE = false;
 startLeaderElection();
}
setCurrentVote(makeLEStrategy().lookForLeader());

{noformat}

The quorum peer thread will skip starting leader election, and when it calls 
{{lookForLeader}}, we might end up getting an empty vote because the FLE 
instance is stopped. This doesn't look like it is fatal, though. The server 
will remain in the looking state, which will force it to try again. Do you 
agree?

bq. Do we need to synchronize on this code in QuorumPeer? 

I think this is the case I'm mentioning above, but let me know.

bq. I think we have another potential dead lock in 
QuorumPeer.restartLeaderElection

Sounds right, good catch. I'm wondering why {{QuorumCnxManager.connectOne}} is 
synchronized on self, all it is trying to do is to connect to another server.

bq. While looking through code, I find it is hard to reason about current 
behavior due to the mixed state changes that could be done by many parties.

A refactor of this code makes perfect sense to me. {{QuorumPeer}} and 
{{QuorumCnxManager}}} could highly benefit from a redesign. FLE could also 
benefit from a refactor, but I find it better encapsulated because in principle 
all we need is to call {{lookForLeader}} and let it send and receive messages.

One thing that would be awesome to do with respect to leader election is to 
remove the need of an additional port. It would be great if we had a single 
port for all server communication.

Sorry about the daydreaming, you got me going...

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489395#comment-15489395
 ] 

Hadoop QA commented on ZOOKEEPER-2080:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12828380/ZOOKEEPER-2080.patch
  against trunk revision b2a484cfe743116d2531fe5d1e1d78b3960c511e.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3427//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3427//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3427//console

This message is automatically generated.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-11 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15482766#comment-15482766
 ] 

Michael Han commented on ZOOKEEPER-2080:


bq. The shut down of leader election doesn't need to be in the synchronized 
block.
This is what the patch did, by moving the shut down of LE out of the sync 
block. 

bq. it should be volatile.
Yes, agreed. Will update patch.

bq. is it possible that due to a race we miss that we are supposed to start LE? 
Possible in theory - an example:
* shuttingDownLE is set to true in 
[FastLeaderElection|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L302]
* There are a couple of other places in Leader / Follower where 
[processReconfig|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L302]
 will be called with restart LE flag set to true, which will set the 
shuttingDownLE flag to false.
* So in theory, between we set shuttingDownLE to true in 
[FastLeaderElection|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L302]
 and we ran into check the same flag in 
[QuorumPeer|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1066],
 there could be concurrent operations which involves processReconfig that 
change the state of shuttingDownLE from true to false. As a result, we miss the 
event of the state change of shuttingDownLE in QuorumPeer's main loop.

Now does it matter if we miss such event due to interleaving operations from 
processReconfig set shuttingDownLE from true to false? I am not sure, as I am 
still learning the semantic of reconfig and FLE to make a better guess here. 

Other notes:
* Do we need to synchronize on this code in QuorumPeer? I think probably we 
should, because similar 'test and set' operation around shuttingDownLE in 
QuorumPeer (such as 
[restartLeaderElection|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1419]
 is synchronized, which makes sense (as typically in multithreading, what we 
need to protected is not just the signal variable but also the executions that 
depends on the value of the variable).

* I think we have another potential dead lock in 
[QuorumPeer.restartLeaderElection|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1415].
 Here the method is synchronized, but part of its execution requires invokes of 
[FastLeaderElection.shutdown|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1418],
 which, as previously analyzed could end up at a place in 
[QuorumCnxManager|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L475]
 where it requires obtain same QuorumPeer lock. I don't think we are shot by 
this yet, could either because we don't have test case cover such case, or 
could because that is just a theory that in practice we will never end up with 
such code execution flow.

* While looking through code, I find it is hard to reason about current 
behavior due to the mixed state changes that could be done by many parties. I 
am thinking if there is better ways of doing this, for example using an actor 
based  mode to restructure the code such that we have an event driven 
architecture where each object has a main thread which only dispatch events and 
entities (such as QuorumPeer and FastLeaderElection) will communicate each 
other by passing event messages - there would not be any shared state and state 
changes for each entities will then be easier to reason about. I think I will 
have a better evaluation after more investigation around FLE codebase. 

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurr

[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-09 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15478360#comment-15478360
 ] 

Flavio Junqueira commented on ZOOKEEPER-2080:
-

I thought a bit more about this. My understanding is that all operations 
related to the current quorum verifier need to be in the synchronized block. 
The shut down of leader election doesn't need to be in the synchronized block.

I don't think "shuttingDownLE" needs to be synchronized with self, but given 
that it is accessed by multiple threads, it should be volatile.

"shuttingDownLE" is a bit misleading because QP uses it to determine that it 
needs to start leader election, not that shut down is in progress. One thing 
that isn't entirely clear to me is whether QP can miss that "shuttingDownLE" is 
supposed to be true due to a race. Specifically, 
[https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1082
 | here] is it possible that due to a race we miss that we are supposed to 
start LE? I think so because it doesn't seem to be synchronized.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-08 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475420#comment-15475420
 ] 

Michael Han commented on ZOOKEEPER-2080:


bq. First, it is in general better to keep such synchronized blocks as short as 
possible. 
Absolutely agree. 
bq. Second, shuttingDownLE is being used without synchronization in other parts 
of the code, so either that is broken or we don't need it in the block.
This was noticed earlier as Alex commented:
{noformat}
Quorum peer does this:
if (shuttingDownLE)
{ shuttingDownLE = false; startLeaderElection(); }
Which isn't synchronized now, but probably should be. 
{noformat}
Similarly I can't prove on paper if this should be sync-ed or not. Need more 
investigation.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-08 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475408#comment-15475408
 ] 

Michael Han commented on ZOOKEEPER-2080:


Thanks for feedback, [~fpj]!

bq. Is it the case that the {{ synchronized(self) }} block only needs to cover 
up to {{self.processReconfig}}

I think from the way the code was implemented, the block is to cover not just 
the processReconfig, but also the {{if (!rqv.equals(curQV)) {}} following, and 
the entire block has to be a single atomic operation.

bq. If so, perhaps we should end the block there and process the following 
outside. 

Are you suggesting something like 
[this|https://github.com/hanm/zookeeper/blob/60bcf70b83605e0fd574a95fd00a32e7e763f6f7/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L292]?
 Here I make the sync block only covers processReconfig. However there are a 
couple of reconfig tests failed deterministically because of such a change. I 
haven't figured out why those tests failed, but it could be the entire code 
block (processReconfig + the rest of what's the current sync covered currently) 
has to be covered such that the processReconfig and this has to be a 
transaction together:
{code}
   if (!rqv.equals(curQV)) {
LOG.info("restarting leader election");
self.shuttingDownLE = true;
self.getElectionAlg().shutdown();
break;
}
{code}

So making the sync block only covers up to processReconfig probably would not 
work ... (I need to dig deeper into the reconfig logic - probably do some data 
flow analysis to figure out the dependencies:) - for now I can't prove on paper 
it's safe (or not) as for where the sync block ends (I am using existing tests 
as the verifier).

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-09-07 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471541#comment-15471541
 ] 

Flavio Junqueira commented on ZOOKEEPER-2080:
-

[~hanm] [~shralex] [~phunt]

I gave it a look and had a look at the patch, and I'm wondering the following:

# Is it the case that the {{synchronized(self) }} block only needs to cover up 
to {{self.processReconfig}}? If so, perhaps we should end the block there and 
process the following outside. I'm saying this because of a couple of things. 
First, it is in general better to keep such synchronized blocks as short as 
possible. Second, {{shuttingDownLE}} is being used without synchronization in 
other parts of the code, so either that is broken or we don't need it in the 
block.
# Should we invert the order between {{shuttingDownLE}} and {{shutdown()}} to 
avoid a new election being created concurrently? I'd really rather avoid having 
multiple concurrent election objects to avoid confusion about connections and 
such.


> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-24 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435457#comment-15435457
 ] 

Patrick Hunt commented on ZOOKEEPER-2080:
-

[~fpj] can you take a look at this one as it's changing FLE code, which you're 
probably the most familiar with.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-10 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415657#comment-15415657
 ] 

Michael Han commented on ZOOKEEPER-2080:


Failed test is TestReconfig.cc that is taking care of by ZOOKEEPER-2152.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-10 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415654#comment-15415654
 ] 

Michael Han commented on ZOOKEEPER-2080:


Yup stress tested (which is why updating the patch takes a while.).

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-10 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415646#comment-15415646
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

+1 from me, assuming you tested the patch using your script to see that it 
solves the problem. 

[~rgs] what do you think ?

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415607#comment-15415607
 ] 

Hadoop QA commented on ZOOKEEPER-2080:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12823092/ZOOKEEPER-2080.patch
  against trunk revision 1755379.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3351//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3351//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3351//console

This message is automatically generated.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-09 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414666#comment-15414666
 ] 

Michael Han commented on ZOOKEEPER-2080:


Alex, the explanation about having an atomic config + update makes sense to me. 

I only considered FastLeaderElection context but looked through code just now 
QuorumPeer::processReconfig, which changes the QuorumPeer::quorumVerifier field 
is also invoked in different context (such as Follower::processPacket, 
Learner::syncWithLeader) against same QuorumPeer which could cause problems as 
you described if we move getQuorumVerifier out of the atomic read-update block. 
I'll update the patch shortly.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-09 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414491#comment-15414491
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

I'm not sure about removing the synchonized in the last change. Configuration 
changes can occur via reconfig commands being committed, or via 
fastleaderelection messages being received. The synchronized was trying to do a 
config read + update (if needed) as an atomic step, whereas now it may be 
updating based on stale read... I'm not sure its actually a problem in this 
case, but also can't say for sure it isn't so personally wouldn't remove that 
part.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-02 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404190#comment-15404190
 ] 

Michael Han commented on ZOOKEEPER-2080:


[~rgs] I did a search on previous Jenkins builds and there were a couple where 
org.apache.zookeeper.test.QuorumTest.testMultipleWatcherObjs also failed, so 
logged ZOOKEEPER-2497 to track this. Yeah I don't think this failure is related 
to this JIRA / reconfig in general. The latest patch also passed my stress 
testing on my cluster BTW.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-01 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403431#comment-15403431
 ] 

Raul Gutierrez Segales commented on ZOOKEEPER-2080:
---

[~hanm]: thanks for the updated patch! I think the failure with 
org.apache.zookeeper.test.QuorumTest.testMultipleWatcherObjs is unrelated, have 
you seen it fail before? Passes for me locally:

{code}
~/src/zookeeper-svn (master) ✔ function runt() { ant -Dtestcase=$1 
test-core-java; } 
~/src/zookeeper-svn (master) ✔ runt QuorumTest

()

junit.run-concurrent:
 [echo] Running 1 concurrent JUnit processes.
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/usr/share/java/ant/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit] Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
[junit] Running org.apache.zookeeper.test.QuorumTest
[junit] Tests run: 15, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
125.227 sec

junit.run:

test-core-java:

BUILD SUCCESSFUL
Total time: 2 minutes 52 seconds

{code}

Other than that, lgtm -- +1. 

[~shralex]: what do you think?

Happy to merge this after Alex gets another look. Thanks!

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403173#comment-15403173
 ] 

Hadoop QA commented on ZOOKEEPER-2080:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12821475/ZOOKEEPER-2080.patch
  against trunk revision 1754582.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3318//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3318//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3318//console

This message is automatically generated.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-08-01 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403146#comment-15403146
 ] 

Michael Han commented on ZOOKEEPER-2080:


[~rgs] Thanks for your feedback :)

bq. Do we really need to synchronize around self for the first part
Good point, self.configFromString does not involve state changes and for 
self.getQuorumVerifier, I can't find another place in code where a race could 
be possibly introduced that require self.getQuorumVerifier and the subsequent 
to be an atomic code block - so I moved them out of sync block and updated 
patch. CC [~shralex] just to double check.

bq. could you maybe add a comment before the synchronize(self) block 
Comments added in latest patch.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-07-31 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401425#comment-15401425
 ] 

Raul Gutierrez Segales commented on ZOOKEEPER-2080:
---

[~hanm]: thanks for tracking this down and for the patch! A few questions/asks, 
looking at the code:

{code}
Election election = null;
synchronized(self) {
try {
rqv = self.configFromString(new String(b));
QuorumVerifier curQV = self.getQuorumVerifier();
if (rqv.getVersion() > curQV.getVersion()) {
LOG.info("{} Received version: {} my version: {}", self.getId(),
Long.toHexString(rqv.getVersion()),
Long.toHexString(self.getQuorumVerifier().getVersion()));
if (self.getPeerState() == ServerState.LOOKING) {
LOG.debug("Invoking processReconfig(), state: {}", 
self.getServerState());
self.processReconfig(rqv, null, null, false);
if (!rqv.equals(curQV)) {
LOG.info("restarting leader election");
// Signaling quorum peer to restart leader election.
self.shuttingDownLE = true;
 // Get a hold of current leader election object of quorum 
peer,
// so we can clean it up later without holding the lock of 
quorum
// peer. If we shutdown current leader election we will run 
into
// potential deadlock. See ZOOKEEPER-2080 for more details.
election = self.getElectionAlg();
}
} else {
LOG.debug("Skip processReconfig(), state: {}", 
self.getServerState());
}
}
} catch (IOException e) {
LOG.error("Something went wrong while processing config received from 
{}", response.sid);
   } catch (ConfigException e) {
   LOG.error("Something went wrong while processing config received from 
{}", response.sid);
   }
}
{code}

Do we really need to synchronize around self for the first part:

{code}
rqv = self.configFromString(new String(b));
QuorumVerifier curQV = self.getQuorumVerifier();
if (rqv.getVersion() > curQV.getVersion()) {

{code}

? Sounds like that can be done without synchronizing... no? 

Also, given you've spent a good amount of cycles untangling the dependencies 
around locking QuorumPeer, could you maybe add a comment before the 
synchronize(self) block noting why it is needed and who else might be 
contending for this lock. Thanks so much!

I think unit testing these things is a bit tricky, we might get a better return 
by just keeping better comments around synchronized regions and generally 
keeping them well maintained (imho). So I am happy to +1 without tests. 

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-07-27 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396091#comment-15396091
 ] 

Michael Han commented on ZOOKEEPER-2080:


Hi Alex, thanks for the review :)
bq. do you think that the creation of a new election object won't be interfered 
if the old object shutdown/GC hasn't happened yet 

The new leader election object and the old leader election object does not 
share object state: each object has their own QuorumCnxManager that manages the 
underlying TCP connections used for leader election. They could in theory 
possibly share the same socket address (election address), because I believe 
this address is statically generated from the connection string instead of 
dynamically generated (like the uniquePort utility we had in test), and this 
address seems to be only thing that different QuroumCnxManager shares. In 
theory we might have two QuorumCnxManager, one from old election object waiting 
to be shutdown and the other one from the new election object, that both try 
binding to same address. I haven't found any issues related this though during 
my stress test on unit tests (in particular for reconfig test), and I think we 
could possibly address this issue by some retry logic with exponential back off 
when binding to socket in QuorumCnxManager.

bq. any way to test this using a unit test
I don't have any concrete ideas around this, my thinking is we could possibly 
expose some options from related classes under test so we can artificially 
inject faults, creating race conditions and control timings. For example we 
could delay the shut down of the old leader election object and see what 
happens. As a simple test, I simply remove the statement completely and 5 out 
of 6 ReconfigRecoveryTest tests failed, which is expected because that is not 
supposed to be completely removed, so maybe instead of removing we can add a 
delay and make sure everything still works.




> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-07-26 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394813#comment-15394813
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

Hi Michael, thanks for the patch, it looks good to me. I have two questions: do 
you think that the creation of a new election object won't be
interfered if the old object if shutdown/GC hasn't happened yet ? and any way 
to test this using a unit test ? Since you're doing
stress tests I'm less worried if a unit test is complicated. You'd probably 
have to somehow suspend a messaging thread to create
the deadlock, not sure.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-07-25 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392825#comment-15392825
 ] 

Michael Han commented on ZOOKEEPER-2080:


[~shralex] Thanks for all the feedback so far:). I am trying to find a 
different solution today (my previous patch does not fix root cause, it is a 
workaround only.). I've tried another solution by [returning 
here|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L475]
 before locking on QuorumPeer if the 'shutdown' flag is set to true, and it 
does not always work (failed my pressure test). I'll try finding another 
solution.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-07-25 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392817#comment-15392817
 ] 

Michael Han commented on ZOOKEEPER-2080:


ZOOKEEPER-2488 to track this.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-07-25 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392565#comment-15392565
 ] 

Michael Han commented on ZOOKEEPER-2080:


bq. Which isn't synchronized now, but probably should be.
Yeah, this looks like a bug to me since all other accesses to the 
'shuttingDownLE' flag is synchronized except these ones:
https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1066
https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1083



> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-07-24 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391189#comment-15391189
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

On second thought, this may not work. There may be another race there with 
access to shuttingDownLE: 
Quorum peer does this:
if (shuttingDownLE) {
shuttingDownLE = false;
startLeaderElection();
}

Which isn't synchronized now, but probably should be. So looks like one has to 
set the flag and do the restart (or at least trigger it) atomically in FLE.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-07-24 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391182#comment-15391182
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

Michael, thanks for the patch! I was wondering whether moving the 
"self.getElectionAlg().shutdown();" outside of the synchronized block in FLE 
would 
also address the problem. The rest of the reconfig stuff is in the synchronized 
block because it touches the configuration state, also accessed by quorum peer. 
But
shutting down FLE doesn't seem like a shared-state operation (am I missing 
something ?). What if you move this line right after the synchronized, do 
something like if (self.shuttingDownLE = true) self.getElectionAlg().shutdown();

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-07-22 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390256#comment-15390256
 ] 

Michael Han commented on ZOOKEEPER-2080:


Friendly nudge [~shralex], [~rgs] - not sure if you guys have cycles to review 
this, but just check. Just to recap, the problem revealed by this issue is a 
real bug I think, that under certain cases a ZooKeeper server can deadlock 
during re-config. I don't have any better solution ATM other than just timeout 
on join, as indicated in this patch.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-06-22 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345538#comment-15345538
 ] 

Michael Han commented on ZOOKEEPER-2080:


Both failed tests on hudson bot might be flaky as they pass for me locally / my 
test cluster. 

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Attachments: ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-06-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345469#comment-15345469
 ] 

Hadoop QA commented on ZOOKEEPER-2080:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12812677/ZOOKEEPER-2080.patch
  against trunk revision 1748630.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3240//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3240//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3240//console

This message is automatically generated.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Attachments: ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-06-22 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345456#comment-15345456
 ] 

Michael Han commented on ZOOKEEPER-2080:


The root cause of FLE shutdown never returns is a deadlock introduced as part 
of ZOOKEEPER-107. The deadlock happens between WorkerReceiver thread of the 
Messenger in FastLeaderElection, and the Listener thread in QuorumCnxManager 
when FastLeaderElection requests restart a new leader election as part of 
dynamic reconfiguration change. An example:

# FastLeaderElection requests [restart leader 
election|https://github.com/apache/zookeeper/blob/ec056d3c3a18b862d0cd83296b7d4319652b0b1c/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L303].
 Note this block is synchronized on the QuorumPeer object self. 
# Restart leader election requires shut down existing QuorumCnxManager first, 
which requires [waiting for listener thread to finish 
execution|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L539].
# At the same time, listener threads could be in a state where it is 
[initiating new connections 
out|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L355].
# While at the previous state, listener thread could run into invocation of 
connectOne, which is [synchronized on the same QuorumPeer 
object|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L475]
 that FLE shutdown acquired earlier.
# As a result, FastLeaderElection is waiting for Listener thread to finish 
while listener thread is waiting for FLE to release the intrinsic lock on 
QuorumPeer, thus deadlock.

The code path that triggers the deadlock is introduced in ZOOKEEPER-107, so 
this issue only impacts 3.5 and not 3.4. I am attaching a patch that fixes the 
issue by specifying a timeout value when join listener thread. I am not super 
satisfied with this fix as relying on timeout is fragile, but it does fix the 
problem (validated all tests passed with my endurance test suites), and the 
side effect of bailing out seems trivial as the listener threads is going to 
die anyway and bail out does not cause leaking any resources.

I am going to dig deeper into reconfig logic see if there is way to fix the 
deadlock which is better than bail out on listener's side. Meanwhile this 
harmless patch is ready to go in if we need a quick / dirty way of fixing the 
problem. 

Also attach a thread dump that indicates the dead lock situation.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-06-16 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334720#comment-15334720
 ] 

Michael Han commented on ZOOKEEPER-2080:


I've been trying reproduce the failure cases and analyzing failed logs in last 
couple of days. After mining enough data, I am fairly confident to say that the 
culprit responsible for the sporadic failure of this test case is 
FastLeaderElection.shutdown, which never returns in the failed cases. What 
happened looks like:

* Server 3 joins ensemble and starts looking for a leader.
* Connections between server 3 and 2/1/0 were broken for some reasons (unclear 
to me, but it happen on both failed and succeeded cases.).
* Server 3 restarts leader election (happens on both failed and succeed cases.).
* The first thing when restart leader election is to shutdown the old FLE, 
where server 3 halts (when joining listener thread.) in failed cases. From this 
point, server 3 is left in a bad state and would never recover (increase 
timeout would not help). 

This also aligns with some of observations previously pointed out by Alex and 
Akihiro. Fix ZOOKEEPER-2246 might fix this as well, so I assigned that issue to 
myself. Working on a patch now (which, not might require get ZOOKEEPER-900 done 
first, we will see.).

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-06-08 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321959#comment-15321959
 ] 

Raul Gutierrez Segales commented on ZOOKEEPER-2080:
---

[~hanm]: sure — go for it. Thanks!

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-06-08 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321770#comment-15321770
 ] 

Michael Han commented on ZOOKEEPER-2080:


I'd like to take a stab here and assign this issue to myself. Ok with you 
[~rgs]?

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-09-15 Thread Akihiro Suda (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14747006#comment-14747006
 ] 

Akihiro Suda commented on ZOOKEEPER-2080:
-

We don't need to use {{java.nio.channels.SocketChannel}} (dicussed in 
[ZOOKEEPER-2246|https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14746992&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14746992]),
 but seem to need redesign of QCM (ZOOKEEPER-901).

ZOOKEEPER-2164 seems also related.

So I link ZOOKEEPER-901 and ZOOKEEPER-2164 , and promote priority from Minor to 
Major.
If this is inappropriate, please feel free to revert this change.


> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-09-11 Thread Akihiro Suda (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740513#comment-14740513
 ] 

Akihiro Suda commented on ZOOKEEPER-2080:
-

Looking at JaCoCo reports, I also noticed that 
[{{QCM.SendWorker#finish()}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L352-L360]
 (and hence {{QCM.RecvWorker#finish()}}) in {{QCM#receiveConnection()}} ({{sid 
< self.getId()}}) is called only on failed experiments.

When I comment out this, the bug got hard to be reproduced.

So I belive that the bug is caused by *a race condition between TCP packet 
arrivals and {{SendWorker}}/{{RecvWorker}} lifecycles*.

Especially, the socket handling in 
[{{QCM.RecvWorker#run}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L893-L926]
 is *very suspicious*, as it cannot be interrupted nor timed out.
(Should use {{java.nio.channels.SocketChannel}} rather than plain old 
{{java.net.Socket}}.)

Note that the bug also got hard to be reproduced when I comment out 
[{{Socket#setTcpNoDelay(true)}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L566]
 in {{QCM#setSockOpts()}} (as I reported on Aug 14), or use 
{{BufferedOutputStream}} instead of {{DataOutputStream}} in 
[{{QCM.SendWorker()}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L719].

[~shralex], can I have your opinion on this?



> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-08-14 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698003#comment-14698003
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

https://builds.apache.org/job/ZooKeeper-trunk/2783/testReport/org.apache.zookeeper.server.quorum/ReconfigRecoveryTest/testCurrentObserverIsParticipantInNewConfig/

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-08-14 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698002#comment-14698002
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

I looked at the logs from two recent failures of this test on hudson. Here's 
what I see in both:

2015-08-13 12:11:05,321 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/127.0.0.1:24740)(secure=disabled):QuorumPeer@986] - 
LOOKING
2015-08-13 12:11:05,321 [myid:3] - INFO  
[QuorumPeer[myid=3](plain=/127.0.0.1:24740)(secure=disabled):FastLeaderElection@894]
 - New election. My id =  3, proposed zxid=0x0
2015-08-13 12:11:05,324 [myid:3] - INFO  
[QuorumPeerListener:QuorumCnxManager$Listener@631] - My election bind port: 
localhost/127.0.0.1:24739
2015-08-13 12:11:05,331 [myid:3] - INFO  
[localhost/127.0.0.1:24739:QuorumCnxManager$Listener@637] - Received connection 
request /127.0.0.1:50282
2015-08-13 12:11:05,331 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection$Messenger$WorkerReceiver@294] - 3 
Received version: 1 my version: 0
2015-08-13 12:11:05,337 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection$Messenger$WorkerReceiver@301] - 
restarting leader election
2015-08-13 12:11:05,540 [myid:3] - WARN  
[RecvWorker:0:QuorumCnxManager$RecvWorker@920] - Interrupting SendWorker
2015-08-13 12:11:05,542 [myid:3] - WARN  
[SendWorker:0:QuorumCnxManager$SendWorker@834] - Interrupted while waiting for 
message on queue
2015-08-13 12:11:05,544 [myid:3] - WARN  
[SendWorker:0:QuorumCnxManager$SendWorker@843] - Send worker leaving thread  id 
0 my id = 3
2015-08-13 12:11:06,411 [myid:3] - INFO  
[NIOServerCxnFactory.AcceptThread:localhost/127.0.0.1:24740:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /127.0.0.1:42161
2015-08-13 12:11:06,412 [myid:3] - WARN  [NIOWorkerThread-1:NIOServerCnxn@369] 
- Exception causing close of session 0x0: ZooKeeperServer not running
2015-08-13 12:11:06,412 [myid:3] - INFO  [NIOWorkerThread-1:NIOServerCnxn@606] 
- Closed socket connection for client /127.0.0.1:42161 (no se

and then the last 3 messages are basically repeated until the end of the test. 

So - after restart attempt in fast leader election, ZooKeeperServer is not 
running

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-08-13 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696525#comment-14696525
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

Thanks for continuing to work on this!

Could you please post the server logs from an execution where it fails ?

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-08-13 Thread Akihiro Suda (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696437#comment-14696437
 ] 

Akihiro Suda commented on ZOOKEEPER-2080:
-

The bug can be almost always reproduced by injecting 80 msecs delay to every 
FLE packets with my tool:
https://github.com/osrg/earthquake/tree/9078c5b039762f6c201ee036ac3453caf6168055/example/zk-repro-2080.nfqhook

When I comment out 
[{{Socket#setTcpNoDelay(true)}}|https://github.com/apache/zookeeper/blob/5b1b668d33ccf7d93c31db2a53728177393fea90/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L566]
 in {{QCnxM#setSockOpts()}}, the bug gets hard to be reproduced.

So I guess the bug is caused by a race condition in {{QCnxM}} (or in {{FLE}}).

Anyone can give us some advice about suspicious point in {{QCnxM}}?

ZOOKEEPER-2246 might be related to 2080, but just applying [this 
fix|https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14694804&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14694804]
 proposed in 2246 does not resolve 2080.


> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-06-05 Thread Chris Nauroth (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575180#comment-14575180
 ] 

Chris Nauroth commented on ZOOKEEPER-2080:
--

We've also seen some intermittent failures from the similar CppUnit tests: 
TestReconfig.cc and TestReconfigServer.cc.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-05-03 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526272#comment-14526272
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

I see. thanks for looking into this.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-05-03 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526269#comment-14526269
 ] 

Raul Gutierrez Segales commented on ZOOKEEPER-2080:
---

Actually, it's expected since we call waitForServerUp on almost all tests, and 
that calls the 'stat' 4-letter cmd:

https://github.com/apache/zookeeper/blob/trunk/src/java/test/org/apache/zookeeper/test/ClientBase.java#L242



> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-05-03 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526266#comment-14526266
 ] 

Raul Gutierrez Segales commented on ZOOKEEPER-2080:
---

Hmmm, interesting. Maybe there is a straggler thread generating those. I'll 
compare with a successful build. 

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-05-03 Thread Alexander Shraer (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526264#comment-14526264
 ] 

Alexander Shraer commented on ZOOKEEPER-2080:
-

I also couldn't reproduce, and we don't have access to the build machine. 
Increasing timeout sounds reasonable. 

Do you know why there are so many "Processing stat command from" messages in 
the log ? is this normal ?

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-05-03 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526245#comment-14526245
 ] 

Raul Gutierrez Segales commented on ZOOKEEPER-2080:
---

It passes pretty consistently for me locally:

{noformat}
~/src/zookeeper-svn (ZOOKEEPER-2171) ✔ ant -Dtestcase=ReconfigRecoveryTest 
test-core-java^


BUILD SUCCESSFUL
Total time: 3 minutes 28 seconds
{noformat}

So I am tempted to propose a patch that raises the wait timeout for 'waiting 
for server 2 being up' and call it a day, unless someone has an proposal to 
refactor this test and make it more robust all together. Thoughts?

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Raul Gutierrez Segales
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2015-03-19 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369593#comment-14369593
 ] 

Ted Yu commented on ZOOKEEPER-2080:
---

Looks like the test doesn't fail recently.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2014-11-13 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209836#comment-14209836
 ] 

Flavio Junqueira commented on ZOOKEEPER-2080:
-

I think this is the same that [~phunt] observed in ZOOKEEPER-1807. I have 
reopened that one. 

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Test
>Reporter: Ted Yu
>Priority: Minor
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)