[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859061#comment-15859061 ] Hudson commented on ZOOKEEPER-2080: --- SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #3269 (See [https://builds.apache.org/job/ZooKeeper-trunk/3269/]) ZOOKEEPER-2080: Fix deadlock in dynamic reconfiguration (rakeshr: rev 434a7eef271fab02306bcc9c8ad29ec2fe2e) * (edit) src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java * (edit) src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859052#comment-15859052 ] Rakesh R commented on ZOOKEEPER-2080: - Thanks [~hanm] for the good work and others for the useful review comments. Committed to branch-3.5: https://git-wip-us.apache.org/repos/asf?p=zookeeper.git;a=commit;h=ac864f53bf7d1f3c8630348ed4cdeb2539ec3932 Committed to master: https://git-wip-us.apache.org/repos/asf?p=zookeeper.git;a=commit;h=434a7eef271fab02306bcc9c8ad29ec2fe2e > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859025#comment-15859025 ] ASF GitHub Bot commented on ZOOKEEPER-2080: --- Github user asfgit closed the pull request at: https://github.com/apache/zookeeper/pull/92 > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858792#comment-15858792 ] Alexander Shraer commented on ZOOKEEPER-2080: - This looks good, thanks Michael. I'm still +1. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858789#comment-15858789 ] Michael Han commented on ZOOKEEPER-2080: Thanks Alex. I made a small change to address your last comment: bq. Instead of passing lastSeenQV as a parameter, could you add a line here Map lastProposedView = self.getLastSeenQuorumVerifier().getAllMembers(); and the change is here fyi. https://github.com/apache/zookeeper/pull/92/commits/25f0caf15a0992b3e871f1fa5c7532f78c949037 > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858783#comment-15858783 ] Alexander Shraer commented on ZOOKEEPER-2080: - Hi Michael, Nothing changed in the code since my last comment right? so I'm still +1. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858773#comment-15858773 ] ASF GitHub Bot commented on ZOOKEEPER-2080: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/92 Hey @shralex, want to close the loop here. Does the latest pull request look good to you? My stress tests on this branch has been green for past two weeks. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834049#comment-15834049 ] ASF GitHub Bot commented on ZOOKEEPER-2080: --- Github user shralex commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/92#discussion_r97266698 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java --- @@ -468,31 +469,33 @@ synchronized private boolean connectOne(long sid, InetSocketAddress electionAddr */ synchronized void connectOne(long sid){ +connectOne(sid, self.getLastSeenQuorumVerifier()); +} + +synchronized void connectOne(long sid, QuorumVerifier lastSeenQV){ if (senderWorkerMap.get(sid) != null) { - LOG.debug("There is a connection already for server " + sid); - return; +LOG.debug("There is a connection already for server " + sid); +return; } -synchronized(self) { - boolean knownId = false; -// Resolve hostname for the remote server before attempting to -// connect in case the underlying ip address has changed. -self.recreateSocketAddresses(sid); -if (self.getView().containsKey(sid)) { - knownId = true; -if (connectOne(sid, self.getView().get(sid).electionAddr)) - return; -} -if (self.getLastSeenQuorumVerifier()!=null && self.getLastSeenQuorumVerifier().getAllMembers().containsKey(sid) - && (!knownId || (self.getLastSeenQuorumVerifier().getAllMembers().get(sid).electionAddr != - self.getView().get(sid).electionAddr))) { - knownId = true; -if (connectOne(sid, self.getLastSeenQuorumVerifier().getAllMembers().get(sid).electionAddr)) - return; -} -if (!knownId) { -LOG.warn("Invalid server id: " + sid); +boolean knownId = false; +// Resolve hostname for the remote server before attempting to +// connect in case the underlying ip address has changed. +self.recreateSocketAddresses(sid); +if (self.getView().containsKey(sid)) { --- End diff -- How about passing also the last committed view so you don't need to call getView() multiple times ? I know you're protecting this with the lock in QuorumPeer, but before I read the other file I thought there may be a race because of multiple accesses to the config. A comment would help here. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830406#comment-15830406 ] ASF GitHub Bot commented on ZOOKEEPER-2080: --- Github user eribeiro commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/92#discussion_r96929565 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java --- @@ -434,6 +434,10 @@ public int getQuorumSize(){ //last proposed quorum verifier public QuorumVerifier lastSeenQuorumVerifier = null; +// Lock object that guard access to quorumVerifier and lastSeenQuorumVerifier. +private byte[] qvLock = new byte[0]; --- End diff -- It should be *final* if you want to use it as a lock object. Also, as constant it would be renamed as *QV_LOCK*. nit: Also, we usually use a `Object`: ``` private final Object LOCK = new Object(); ``` or a `ReentrantReadWriteLock` (more verbose tough, with try-finally, etc). > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830354#comment-15830354 ] Hadoop QA commented on ZOOKEEPER-2080: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12834574/ZOOKEEPER-2080.patch against trunk revision 42c75b5f2457f8ea5b4106ce5dc1c34c330361c0. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3563//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3563//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3563//console This message is automatically generated. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830331#comment-15830331 ] Michael Han commented on ZOOKEEPER-2080: Update... current status of this issue: * Rebased against latest master. * My stress tests on failed unit tests all pass. * Code reviewed by @lvfangmin on github. [~shralex][~fpj] Could you do a review on [Pull Request 92|https://github.com/apache/zookeeper/pull/92] when you have a chance? > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830313#comment-15830313 ] Hadoop QA commented on ZOOKEEPER-2080: -- +1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/225//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/225//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/225//console This message is automatically generated. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722843#comment-15722843 ] Hadoop QA commented on ZOOKEEPER-2080: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/100//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/100//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/100//console This message is automatically generated. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722835#comment-15722835 ] ASF GitHub Bot commented on ZOOKEEPER-2080: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/92#discussion_r90916712 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java --- @@ -1390,24 +1406,29 @@ public QuorumVerifier configFromString(String s) throws IOException, ConfigExcep } /** - * Return QuorumVerifier object for the last committed configuration + * Return QuorumVerifier object for the last committed configuration. */ - -public synchronized QuorumVerifier getQuorumVerifier(){ -return quorumVerifier; - +public QuorumVerifier getQuorumVerifier(){ +synchronized (qvLock) { +return quorumVerifier; +} } -public synchronized QuorumVerifier getLastSeenQuorumVerifier(){ -return lastSeenQuorumVerifier; +/** + * Return QuorumVerifier object for the last proposed configuration. + */ +public QuorumVerifier getLastSeenQuorumVerifier(){ +synchronized (qvLock) { +return lastSeenQuorumVerifier; +} } -public synchronized void connectNewPeers(){ - if (qcm!=null && getQuorumVerifier()!=null && getLastSeenQuorumVerifier()!=null) { - Map committedView = getQuorumVerifier().getAllMembers(); - for (Entry e: getLastSeenQuorumVerifier().getAllMembers().entrySet()){ +private void connectNewPeers(){ --- End diff -- @lvfangmin Good suggestion, PR updated. Thanks for double checking. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15718771#comment-15718771 ] ASF GitHub Bot commented on ZOOKEEPER-2080: --- Github user lvfangmin commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/92#discussion_r90764675 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java --- @@ -1390,24 +1406,29 @@ public QuorumVerifier configFromString(String s) throws IOException, ConfigExcep } /** - * Return QuorumVerifier object for the last committed configuration + * Return QuorumVerifier object for the last committed configuration. */ - -public synchronized QuorumVerifier getQuorumVerifier(){ -return quorumVerifier; - +public QuorumVerifier getQuorumVerifier(){ +synchronized (qvLock) { +return quorumVerifier; +} } -public synchronized QuorumVerifier getLastSeenQuorumVerifier(){ -return lastSeenQuorumVerifier; +/** + * Return QuorumVerifier object for the last proposed configuration. + */ +public QuorumVerifier getLastSeenQuorumVerifier(){ +synchronized (qvLock) { +return lastSeenQuorumVerifier; +} } -public synchronized void connectNewPeers(){ - if (qcm!=null && getQuorumVerifier()!=null && getLastSeenQuorumVerifier()!=null) { - Map committedView = getQuorumVerifier().getAllMembers(); - for (Entry e: getLastSeenQuorumVerifier().getAllMembers().entrySet()){ +private void connectNewPeers(){ --- End diff -- Hi @hanm, I've followed the long discussion on the Jira, thanks for digging deeper into the problem. LGTM, only a small suggestion: it's error-prone to assume that in the future this method won't be called by other non-synchronized methods, we'd better to add the synchronize(qvLock) here too. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613237#comment-15613237 ] Ted Yu commented on ZOOKEEPER-2080: --- Thanks for the effort, Michael. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593437#comment-15593437 ] Hadoop QA commented on ZOOKEEPER-2080: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12834574/ZOOKEEPER-2080.patch against trunk revision cef5978969bedfe066f903834a9ea4af6d508844. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//console This message is automatically generated. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593405#comment-15593405 ] Michael Han commented on ZOOKEEPER-2080: I've also left some comments on PR about the refactoring. The patch is stress tested on internal Jenkins with the previous failed tests, and so far all tests passed (with 500+ runs). > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593386#comment-15593386 ] ASF GitHub Bot commented on ZOOKEEPER-2080: --- GitHub user hanm opened a pull request: https://github.com/apache/zookeeper/pull/92 ZOOKEEPER-2080: Fix deadlock in dynamic reconfiguration. Use explicit fine grained locks for synchronizing access to QuorumVerifier states in QuorumPeer. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hanm/zookeeper ZOOKEEPER-2080 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/92.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #92 commit cfb4d73a94f23c6e047f75d400372caa495f197f Author: Michael Han Date: 2016-10-20T22:37:22Z Use explicit fine grained lock for maintaining a consistent global view of QuorumVerifier in QuorumPeer. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593380#comment-15593380 ] Michael Han commented on ZOOKEEPER-2080: Getting back on this issue. I have another perspective - rather than synchronizing on entire QuorumPeer we could use fine grained locks to specifically guard the set of states affiliated with QuorumVerifier, namely: * Quorum verifier. * Last seen quorum verifier. * Some additional fields that appertain to QV and last QV, such as quorum address, election port, and client address. As for what to guard, I feel guard on individual fields (QV, last QV, quorum address, etc) is enough after examining the code paths and call graphs - if I am wrong, then we will need to figure out a compound state and invent a new class that encapsulates all these states. Also we should make sure that every access to these states will be guarded by the same lock. bq. But this update would have to synchronize on both QP and QCM, so not sure if the same problem exists as above. I feel the same way - this solution just moves the lock around =). bq. Do we need to update all view listeners in a critical section? I think yes, if we want to maintain a global consistent view of the QV states. When the value of QV in QP is updated, QP needs to fire events to notify subscribers of such change, and without locking when the event hits subscriber the view of the world might already changed. We might also need locking on the event callbacks of each listener, to avoid any potential race condition of each listener. But this case the likely hood of running into race / dead lock sounds lower as we are not synchronizing on a single QP anymore. In any cases, I feel refactoring the QuorumPeer to use fine grained locking is necessary. Attaching a patch to express my idea. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526260#comment-15526260 ] Flavio Junqueira commented on ZOOKEEPER-2080: - The only reason I had to suggest that we use this listener pattern is that we apparently need the view across a number of classes, so instead of having to have QP keeping direct references to each of these classes, we simply keep implementations of some {{ViewListener}} interface. It is unclear to me whether we need to synchronize on QP when invoking the interface call on each of the registered instances. Do we need to update all view listeners in a critical section? > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523711#comment-15523711 ] Alexander Shraer commented on ZOOKEEPER-2080: - I see. Maybe in this case Flavio's suggestion would work, or even a simpler method where you keep a copy of the views in QCM and update them from the QuorumPeer methods that set the last committed and last proposed views. But this update would have to synchronize on both QP and QCM, so not sure if the same problem exists as above. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523638#comment-15523638 ] Michael Han commented on ZOOKEEPER-2080: Hi Alex, the constraint here is that we can't invoke any of the QuroumPeer's synchronized functions inside the listener thread of QuorumCnxManager, because when we restart leader election, the same QuorumPeer object is acquired first to kick off the restart, and the lock would not be released until restart finished - and the restart would not finish until listener thread joins. Due to timing we could end up in any places inside listener's run method, so if we try to synchronize on the same QuorumPeer we would get blocked indefinitely. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523553#comment-15523553 ] Alexander Shraer commented on ZOOKEEPER-2080: - I agree about the cause, but my suggestion (above) may be easier - create one synchronized method that returns both committed and proposed configs together (though an unmodifiable collection sounds like a good idea). > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519026#comment-15519026 ] Flavio Junqueira commented on ZOOKEEPER-2080: - I suspect that {{connectOne}} needs to synchronize on self because it needs a consistent view of {{self.getView()}} and {{self.getLastSeenQuorumVerifier()}}. In fact, one interesting thing is that {{self.getView()}} is declared as: {noformat} public Map getView() { return Collections.unmodifiableMap(getQuorumVerifier().getAllMembers()); } {noformat} So all we really need is {{self.getLastSeenQuorumVerifier()}}. The root cause of all these deadlocks seems to be that we are trying to get a consistent view of the ensemble and locking {{QuorumPeer}} to guarantee consistency. The complex interdependencies across classes is making it difficult to guarantee that we don't have deadlocks. My suggestion is that we take a different approach. Each class that needs a consistent view of {{self.getLastSeenQuorumVerifier()}} will implement a listener that caches the new value locally, and {{QuorumPeer}} will broadcast changes to the quorum verifier to all listeners. Broadcasting can be done under a lock to prevent races with other operations inside {{QuorumPeer}}. I think that if we do something like this, we will be avoiding the circular dependencies and fixing the deadlocks. The change doesn't seem to be super complex, but I could be wrong, though. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517870#comment-15517870 ] Michael Han commented on ZOOKEEPER-2080: Thanks a ton for your feedback Alex, and Flavio. Had more thoughts about this today. First to reply your comments posted earlier: bq. If we pass the QV instance as part of the connectOne call, then can we remove the synchronized(self) block? To pass QV instances we have to first retrieve them from a QuorumPeer object. Get QV instances on a QuorumPeer object requires synchronization on the QuorumPeer itself, either using existing methods or using what Alex was suggesting: bq. To remove the synchronized block completely you could create a synchronized function in QuorumPeer that returns a pair So in any case we will still possibly end up with a deadlock somewhere inside listener threads - we just moved the deadlock from one place to the other :) - because somewhere in the listener thread of QuorumCnxManager we are going to invoke synchronized version of getQV, which might block because the same QuorumPeer object which the getQV methods are invoking were acquired as a lock during restart of FLE. To recap the deadlock problem - a simplified version description: * Restart leader election acquires a QuorumPeer lock. * Restart leader election requires shut down QuorumCnxManager as part of the process while holding the lock. * Shutdown QuorumCnxManager requires synchronize on the same QuorumPeer. The solution for each of these in order: * Don't acquire a QuorumPeer lock when restart LE. This is less likely a solution due to the state changes to QuorumPeer during the process required the process has to be synchronized. * Don't hold the lock while shutdown QuorumCnxManager. This is essentially the latest patch did by moving the shutdown out of the sync block. However, it does not fix the issue I spotted in previous comments (potential dead lock in QuorumPeer.restartLeaderElection - because this method synchronized implicitly on the intrinsic lock). * Don't synchronize on the QuorumPeer inside QuorumCnxManager's shutdown process: this implies we can't involve any of the QuorumPeer's synchronized methods in listener threads run loop - otherwise we will end up a deadlock. As previously commented, removing {{synchronized(self)}} block by passing QV instance to {{connectOne}} would not work because we still need to invoke the synchronized version of getQA somewhere in listener threads. Other thoughts - we could possibly interrupt the connectOne which is synchronizing on the QuorumPeer. It should be fine to interrupt because at this point everything is shutting down and the server socket was already closed (in listener.halt) so no resource leak concerns. So we just need to find a way that after we call {{listener.halt()}} in {{QuorumCnxManager.halt()}}, we can signal the listener threads to bail out the blocking state - I need to think more about how to do this because it seems not possible to interrupt a synchronized code block - so we may have to use additional abstractions (i.e. ReentrantLock?) which adds more complexity. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504624#comment-15504624 ] Flavio Junqueira commented on ZOOKEEPER-2080: - Here is a thought. {{connectOne}} is synchronized both on the {{QuorumCnxManager}} object and {{self}}, which is a {{QuorumPeer}} instance. If we pass the QV instance as part of the {{connectOne}} call, then can we remove the {{synchronized(self)}} block? I think the method is synchronized because of accesses such as the one to {{sendWorkerMap}}. cc [~hanm] [~shralex] > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495437#comment-15495437 ] Alexander Shraer commented on ZOOKEEPER-2080: - If I'm not mistaken, when the configuration changes we create a completely new QuorumVerifier object, so if you call the new function and have a reference to the old QV, the information in that object shouldn't change. But please make sure that this is the case and document this assumption if you choose to use this method. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495429#comment-15495429 ] Alexander Shraer commented on ZOOKEEPER-2080: - There is a lot of accesses to the config info from this function, and the synchronization may be needed to make sure all of them return a state which is consistent. getView returns the last committed config and getLastSeenQuorumVerifier returns the last proposed config. If we wouldn't be in a synchronized block, and the committed view changed (reconfig commits while we're in this function), it seems possible that sid moves from last proposed to last committed config and we will never try to connect to it. First, getLastSeenQuorumVerifier and getView shouldn't be accessed more than once (currently 4 and 3 times each). We can just store the result and use that. To remove the synchronized block completely you could create a synchronized function in QuorumPeer that returns a pair: last committed and last proposed view, and call that function once from connectOne. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494088#comment-15494088 ] Michael Han commented on ZOOKEEPER-2080: Regarding [~fpj]'s comment on bq. I'm wondering why QuorumCnxManager.connectOne is synchronized on self, all it is trying to do is to connect to another server. The synchronization is introduced as part of ZOOKEEPER-107 ([reference|https://github.com/apache/zookeeper/commit/675ef85ba5cd97d69b875d14854b638da6f4423c#diff-1e15b8a38445193d635ac681fe37df67R409). I think the synchronization is required because this code check and react to state of QuorumPeer thus needs to synchronize with self as a whole. If we can get ride of the synchronization here, then both this JIRA and the potential deadlock spotted at {{QuorumPeer.restartLeaderElection}} can be fixed. I am not sure if that is feasible though. [~shralex]: Could you shed some lights on the motivation of synchronizing on self in QuorumCnxManager.connectOne and also, any possibilities we can get ride of the synchronization here? > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15494073#comment-15494073 ] Michael Han commented on ZOOKEEPER-2080: bq. The server will remain in the looking state, which will force it to try again. Do you agree? Yes, the explanation makes sense to me. Previously I was not sure about this as I was concerning if skipping this will cause the following {{setCurrentVote}} fail, but now it looks like everything will converge. bq. I think this is the case I'm mentioning above, but let me know. Agreed. bq. A refactor of this code makes perfect sense to me. Out of scope of this JIRA, but notes are taken :) > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490982#comment-15490982 ] Flavio Junqueira commented on ZOOKEEPER-2080: - On the race involving {{shuttingDownLE}}, say that FLE sets it to true right after the {{QuorumPeer}} thread has checked that it is false here: {noformat} if (shuttingDownLE) { shuttingDownLE = false; startLeaderElection(); } setCurrentVote(makeLEStrategy().lookForLeader()); {noformat} The quorum peer thread will skip starting leader election, and when it calls {{lookForLeader}}, we might end up getting an empty vote because the FLE instance is stopped. This doesn't look like it is fatal, though. The server will remain in the looking state, which will force it to try again. Do you agree? bq. Do we need to synchronize on this code in QuorumPeer? I think this is the case I'm mentioning above, but let me know. bq. I think we have another potential dead lock in QuorumPeer.restartLeaderElection Sounds right, good catch. I'm wondering why {{QuorumCnxManager.connectOne}} is synchronized on self, all it is trying to do is to connect to another server. bq. While looking through code, I find it is hard to reason about current behavior due to the mixed state changes that could be done by many parties. A refactor of this code makes perfect sense to me. {{QuorumPeer}} and {{QuorumCnxManager}}} could highly benefit from a redesign. FLE could also benefit from a refactor, but I find it better encapsulated because in principle all we need is to call {{lookForLeader}} and let it send and receive messages. One thing that would be awesome to do with respect to leader election is to remove the need of an additional port. It would be great if we had a single port for all server communication. Sorry about the daydreaming, you got me going... > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489395#comment-15489395 ] Hadoop QA commented on ZOOKEEPER-2080: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12828380/ZOOKEEPER-2080.patch against trunk revision b2a484cfe743116d2531fe5d1e1d78b3960c511e. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3427//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3427//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3427//console This message is automatically generated. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15482766#comment-15482766 ] Michael Han commented on ZOOKEEPER-2080: bq. The shut down of leader election doesn't need to be in the synchronized block. This is what the patch did, by moving the shut down of LE out of the sync block. bq. it should be volatile. Yes, agreed. Will update patch. bq. is it possible that due to a race we miss that we are supposed to start LE? Possible in theory - an example: * shuttingDownLE is set to true in [FastLeaderElection|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L302] * There are a couple of other places in Leader / Follower where [processReconfig|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L302] will be called with restart LE flag set to true, which will set the shuttingDownLE flag to false. * So in theory, between we set shuttingDownLE to true in [FastLeaderElection|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L302] and we ran into check the same flag in [QuorumPeer|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1066], there could be concurrent operations which involves processReconfig that change the state of shuttingDownLE from true to false. As a result, we miss the event of the state change of shuttingDownLE in QuorumPeer's main loop. Now does it matter if we miss such event due to interleaving operations from processReconfig set shuttingDownLE from true to false? I am not sure, as I am still learning the semantic of reconfig and FLE to make a better guess here. Other notes: * Do we need to synchronize on this code in QuorumPeer? I think probably we should, because similar 'test and set' operation around shuttingDownLE in QuorumPeer (such as [restartLeaderElection|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1419] is synchronized, which makes sense (as typically in multithreading, what we need to protected is not just the signal variable but also the executions that depends on the value of the variable). * I think we have another potential dead lock in [QuorumPeer.restartLeaderElection|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1415]. Here the method is synchronized, but part of its execution requires invokes of [FastLeaderElection.shutdown|https://github.com/apache/zookeeper/blob/master/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1418], which, as previously analyzed could end up at a place in [QuorumCnxManager|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L475] where it requires obtain same QuorumPeer lock. I don't think we are shot by this yet, could either because we don't have test case cover such case, or could because that is just a theory that in practice we will never end up with such code execution flow. * While looking through code, I find it is hard to reason about current behavior due to the mixed state changes that could be done by many parties. I am thinking if there is better ways of doing this, for example using an actor based mode to restructure the code such that we have an event driven architecture where each object has a main thread which only dispatch events and entities (such as QuorumPeer and FastLeaderElection) will communicate each other by passing event messages - there would not be any shared state and state changes for each entities will then be easier to reason about. I think I will have a better evaluation after more investigation around FLE codebase. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurr
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15478360#comment-15478360 ] Flavio Junqueira commented on ZOOKEEPER-2080: - I thought a bit more about this. My understanding is that all operations related to the current quorum verifier need to be in the synchronized block. The shut down of leader election doesn't need to be in the synchronized block. I don't think "shuttingDownLE" needs to be synchronized with self, but given that it is accessed by multiple threads, it should be volatile. "shuttingDownLE" is a bit misleading because QP uses it to determine that it needs to start leader election, not that shut down is in progress. One thing that isn't entirely clear to me is whether QP can miss that "shuttingDownLE" is supposed to be true due to a race. Specifically, [https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1082 | here] is it possible that due to a race we miss that we are supposed to start LE? I think so because it doesn't seem to be synchronized. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475420#comment-15475420 ] Michael Han commented on ZOOKEEPER-2080: bq. First, it is in general better to keep such synchronized blocks as short as possible. Absolutely agree. bq. Second, shuttingDownLE is being used without synchronization in other parts of the code, so either that is broken or we don't need it in the block. This was noticed earlier as Alex commented: {noformat} Quorum peer does this: if (shuttingDownLE) { shuttingDownLE = false; startLeaderElection(); } Which isn't synchronized now, but probably should be. {noformat} Similarly I can't prove on paper if this should be sync-ed or not. Need more investigation. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475408#comment-15475408 ] Michael Han commented on ZOOKEEPER-2080: Thanks for feedback, [~fpj]! bq. Is it the case that the {{ synchronized(self) }} block only needs to cover up to {{self.processReconfig}} I think from the way the code was implemented, the block is to cover not just the processReconfig, but also the {{if (!rqv.equals(curQV)) {}} following, and the entire block has to be a single atomic operation. bq. If so, perhaps we should end the block there and process the following outside. Are you suggesting something like [this|https://github.com/hanm/zookeeper/blob/60bcf70b83605e0fd574a95fd00a32e7e763f6f7/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L292]? Here I make the sync block only covers processReconfig. However there are a couple of reconfig tests failed deterministically because of such a change. I haven't figured out why those tests failed, but it could be the entire code block (processReconfig + the rest of what's the current sync covered currently) has to be covered such that the processReconfig and this has to be a transaction together: {code} if (!rqv.equals(curQV)) { LOG.info("restarting leader election"); self.shuttingDownLE = true; self.getElectionAlg().shutdown(); break; } {code} So making the sync block only covers up to processReconfig probably would not work ... (I need to dig deeper into the reconfig logic - probably do some data flow analysis to figure out the dependencies:) - for now I can't prove on paper it's safe (or not) as for where the sync block ends (I am using existing tests as the verifier). > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471541#comment-15471541 ] Flavio Junqueira commented on ZOOKEEPER-2080: - [~hanm] [~shralex] [~phunt] I gave it a look and had a look at the patch, and I'm wondering the following: # Is it the case that the {{synchronized(self) }} block only needs to cover up to {{self.processReconfig}}? If so, perhaps we should end the block there and process the following outside. I'm saying this because of a couple of things. First, it is in general better to keep such synchronized blocks as short as possible. Second, {{shuttingDownLE}} is being used without synchronization in other parts of the code, so either that is broken or we don't need it in the block. # Should we invert the order between {{shuttingDownLE}} and {{shutdown()}} to avoid a new election being created concurrently? I'd really rather avoid having multiple concurrent election objects to avoid confusion about connections and such. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435457#comment-15435457 ] Patrick Hunt commented on ZOOKEEPER-2080: - [~fpj] can you take a look at this one as it's changing FLE code, which you're probably the most familiar with. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415657#comment-15415657 ] Michael Han commented on ZOOKEEPER-2080: Failed test is TestReconfig.cc that is taking care of by ZOOKEEPER-2152. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415654#comment-15415654 ] Michael Han commented on ZOOKEEPER-2080: Yup stress tested (which is why updating the patch takes a while.). > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415646#comment-15415646 ] Alexander Shraer commented on ZOOKEEPER-2080: - +1 from me, assuming you tested the patch using your script to see that it solves the problem. [~rgs] what do you think ? > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415607#comment-15415607 ] Hadoop QA commented on ZOOKEEPER-2080: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12823092/ZOOKEEPER-2080.patch against trunk revision 1755379. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3351//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3351//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3351//console This message is automatically generated. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414666#comment-15414666 ] Michael Han commented on ZOOKEEPER-2080: Alex, the explanation about having an atomic config + update makes sense to me. I only considered FastLeaderElection context but looked through code just now QuorumPeer::processReconfig, which changes the QuorumPeer::quorumVerifier field is also invoked in different context (such as Follower::processPacket, Learner::syncWithLeader) against same QuorumPeer which could cause problems as you described if we move getQuorumVerifier out of the atomic read-update block. I'll update the patch shortly. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414491#comment-15414491 ] Alexander Shraer commented on ZOOKEEPER-2080: - I'm not sure about removing the synchonized in the last change. Configuration changes can occur via reconfig commands being committed, or via fastleaderelection messages being received. The synchronized was trying to do a config read + update (if needed) as an atomic step, whereas now it may be updating based on stale read... I'm not sure its actually a problem in this case, but also can't say for sure it isn't so personally wouldn't remove that part. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15404190#comment-15404190 ] Michael Han commented on ZOOKEEPER-2080: [~rgs] I did a search on previous Jenkins builds and there were a couple where org.apache.zookeeper.test.QuorumTest.testMultipleWatcherObjs also failed, so logged ZOOKEEPER-2497 to track this. Yeah I don't think this failure is related to this JIRA / reconfig in general. The latest patch also passed my stress testing on my cluster BTW. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403431#comment-15403431 ] Raul Gutierrez Segales commented on ZOOKEEPER-2080: --- [~hanm]: thanks for the updated patch! I think the failure with org.apache.zookeeper.test.QuorumTest.testMultipleWatcherObjs is unrelated, have you seen it fail before? Passes for me locally: {code} ~/src/zookeeper-svn (master) ✔ function runt() { ant -Dtestcase=$1 test-core-java; } ~/src/zookeeper-svn (master) ✔ runt QuorumTest () junit.run-concurrent: [echo] Running 1 concurrent JUnit processes. [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/usr/share/java/ant/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8 [junit] Running org.apache.zookeeper.test.QuorumTest [junit] Tests run: 15, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 125.227 sec junit.run: test-core-java: BUILD SUCCESSFUL Total time: 2 minutes 52 seconds {code} Other than that, lgtm -- +1. [~shralex]: what do you think? Happy to merge this after Alex gets another look. Thanks! > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403173#comment-15403173 ] Hadoop QA commented on ZOOKEEPER-2080: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12821475/ZOOKEEPER-2080.patch against trunk revision 1754582. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3318//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3318//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3318//console This message is automatically generated. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403146#comment-15403146 ] Michael Han commented on ZOOKEEPER-2080: [~rgs] Thanks for your feedback :) bq. Do we really need to synchronize around self for the first part Good point, self.configFromString does not involve state changes and for self.getQuorumVerifier, I can't find another place in code where a race could be possibly introduced that require self.getQuorumVerifier and the subsequent to be an atomic code block - so I moved them out of sync block and updated patch. CC [~shralex] just to double check. bq. could you maybe add a comment before the synchronize(self) block Comments added in latest patch. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log, threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401425#comment-15401425 ] Raul Gutierrez Segales commented on ZOOKEEPER-2080: --- [~hanm]: thanks for tracking this down and for the patch! A few questions/asks, looking at the code: {code} Election election = null; synchronized(self) { try { rqv = self.configFromString(new String(b)); QuorumVerifier curQV = self.getQuorumVerifier(); if (rqv.getVersion() > curQV.getVersion()) { LOG.info("{} Received version: {} my version: {}", self.getId(), Long.toHexString(rqv.getVersion()), Long.toHexString(self.getQuorumVerifier().getVersion())); if (self.getPeerState() == ServerState.LOOKING) { LOG.debug("Invoking processReconfig(), state: {}", self.getServerState()); self.processReconfig(rqv, null, null, false); if (!rqv.equals(curQV)) { LOG.info("restarting leader election"); // Signaling quorum peer to restart leader election. self.shuttingDownLE = true; // Get a hold of current leader election object of quorum peer, // so we can clean it up later without holding the lock of quorum // peer. If we shutdown current leader election we will run into // potential deadlock. See ZOOKEEPER-2080 for more details. election = self.getElectionAlg(); } } else { LOG.debug("Skip processReconfig(), state: {}", self.getServerState()); } } } catch (IOException e) { LOG.error("Something went wrong while processing config received from {}", response.sid); } catch (ConfigException e) { LOG.error("Something went wrong while processing config received from {}", response.sid); } } {code} Do we really need to synchronize around self for the first part: {code} rqv = self.configFromString(new String(b)); QuorumVerifier curQV = self.getQuorumVerifier(); if (rqv.getVersion() > curQV.getVersion()) { {code} ? Sounds like that can be done without synchronizing... no? Also, given you've spent a good amount of cycles untangling the dependencies around locking QuorumPeer, could you maybe add a comment before the synchronize(self) block noting why it is needed and who else might be contending for this lock. Thanks so much! I think unit testing these things is a bit tricky, we might get a better return by just keeping better comments around synchronized regions and generally keeping them well maintained (imho). So I am happy to +1 without tests. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396091#comment-15396091 ] Michael Han commented on ZOOKEEPER-2080: Hi Alex, thanks for the review :) bq. do you think that the creation of a new election object won't be interfered if the old object shutdown/GC hasn't happened yet The new leader election object and the old leader election object does not share object state: each object has their own QuorumCnxManager that manages the underlying TCP connections used for leader election. They could in theory possibly share the same socket address (election address), because I believe this address is statically generated from the connection string instead of dynamically generated (like the uniquePort utility we had in test), and this address seems to be only thing that different QuroumCnxManager shares. In theory we might have two QuorumCnxManager, one from old election object waiting to be shutdown and the other one from the new election object, that both try binding to same address. I haven't found any issues related this though during my stress test on unit tests (in particular for reconfig test), and I think we could possibly address this issue by some retry logic with exponential back off when binding to socket in QuorumCnxManager. bq. any way to test this using a unit test I don't have any concrete ideas around this, my thinking is we could possibly expose some options from related classes under test so we can artificially inject faults, creating race conditions and control timings. For example we could delay the shut down of the old leader election object and see what happens. As a simple test, I simply remove the statement completely and 5 out of 6 ReconfigRecoveryTest tests failed, which is expected because that is not supposed to be completely removed, so maybe instead of removing we can add a delay and make sure everything still works. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394813#comment-15394813 ] Alexander Shraer commented on ZOOKEEPER-2080: - Hi Michael, thanks for the patch, it looks good to me. I have two questions: do you think that the creation of a new election object won't be interfered if the old object if shutdown/GC hasn't happened yet ? and any way to test this using a unit test ? Since you're doing stress tests I'm less worried if a unit test is complicated. You'd probably have to somehow suspend a messaging thread to create the deadlock, not sure. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392825#comment-15392825 ] Michael Han commented on ZOOKEEPER-2080: [~shralex] Thanks for all the feedback so far:). I am trying to find a different solution today (my previous patch does not fix root cause, it is a workaround only.). I've tried another solution by [returning here|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L475] before locking on QuorumPeer if the 'shutdown' flag is set to true, and it does not always work (failed my pressure test). I'll try finding another solution. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392817#comment-15392817 ] Michael Han commented on ZOOKEEPER-2080: ZOOKEEPER-2488 to track this. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392565#comment-15392565 ] Michael Han commented on ZOOKEEPER-2080: bq. Which isn't synchronized now, but probably should be. Yeah, this looks like a bug to me since all other accesses to the 'shuttingDownLE' flag is synchronized except these ones: https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1066 https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L1083 > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391189#comment-15391189 ] Alexander Shraer commented on ZOOKEEPER-2080: - On second thought, this may not work. There may be another race there with access to shuttingDownLE: Quorum peer does this: if (shuttingDownLE) { shuttingDownLE = false; startLeaderElection(); } Which isn't synchronized now, but probably should be. So looks like one has to set the flag and do the restart (or at least trigger it) atomically in FLE. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391182#comment-15391182 ] Alexander Shraer commented on ZOOKEEPER-2080: - Michael, thanks for the patch! I was wondering whether moving the "self.getElectionAlg().shutdown();" outside of the synchronized block in FLE would also address the problem. The rest of the reconfig stuff is in the synchronized block because it touches the configuration state, also accessed by quorum peer. But shutting down FLE doesn't seem like a shared-state operation (am I missing something ?). What if you move this line right after the synchronized, do something like if (self.shuttingDownLE = true) self.getElectionAlg().shutdown(); > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390256#comment-15390256 ] Michael Han commented on ZOOKEEPER-2080: Friendly nudge [~shralex], [~rgs] - not sure if you guys have cycles to review this, but just check. Just to recap, the problem revealed by this issue is a real bug I think, that under certain cases a ZooKeeper server can deadlock during re-config. I don't have any better solution ATM other than just timeout on join, as indicated in this patch. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Fix For: 3.5.3, 3.6.0 > > Attachments: ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345538#comment-15345538 ] Michael Han commented on ZOOKEEPER-2080: Both failed tests on hudson bot might be flaky as they pass for me locally / my test cluster. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Attachments: ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345469#comment-15345469 ] Hadoop QA commented on ZOOKEEPER-2080: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12812677/ZOOKEEPER-2080.patch against trunk revision 1748630. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3240//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3240//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3240//console This message is automatically generated. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Attachments: ZOOKEEPER-2080.patch, > jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, > threaddump.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345456#comment-15345456 ] Michael Han commented on ZOOKEEPER-2080: The root cause of FLE shutdown never returns is a deadlock introduced as part of ZOOKEEPER-107. The deadlock happens between WorkerReceiver thread of the Messenger in FastLeaderElection, and the Listener thread in QuorumCnxManager when FastLeaderElection requests restart a new leader election as part of dynamic reconfiguration change. An example: # FastLeaderElection requests [restart leader election|https://github.com/apache/zookeeper/blob/ec056d3c3a18b862d0cd83296b7d4319652b0b1c/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java#L303]. Note this block is synchronized on the QuorumPeer object self. # Restart leader election requires shut down existing QuorumCnxManager first, which requires [waiting for listener thread to finish execution|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L539]. # At the same time, listener threads could be in a state where it is [initiating new connections out|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L355]. # While at the previous state, listener thread could run into invocation of connectOne, which is [synchronized on the same QuorumPeer object|https://github.com/apache/zookeeper/blob/3c37184e83a3e68b73544cebccf9388eea26f523/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L475] that FLE shutdown acquired earlier. # As a result, FastLeaderElection is waiting for Listener thread to finish while listener thread is waiting for FLE to release the intrinsic lock on QuorumPeer, thus deadlock. The code path that triggers the deadlock is introduced in ZOOKEEPER-107, so this issue only impacts 3.5 and not 3.4. I am attaching a patch that fixes the issue by specifying a timeout value when join listener thread. I am not super satisfied with this fix as relying on timeout is fragile, but it does fix the problem (validated all tests passed with my endurance test suites), and the side effect of bailing out seems trivial as the listener threads is going to die anyway and bail out does not cause leaking any resources. I am going to dig deeper into reconfig logic see if there is way to fix the deadlock which is better than bail out on listener's side. Meanwhile this harmless patch is ready to go in if we need a quick / dirty way of fixing the problem. Also attach a thread dump that indicates the dead lock situation. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334720#comment-15334720 ] Michael Han commented on ZOOKEEPER-2080: I've been trying reproduce the failure cases and analyzing failed logs in last couple of days. After mining enough data, I am fairly confident to say that the culprit responsible for the sporadic failure of this test case is FastLeaderElection.shutdown, which never returns in the failed cases. What happened looks like: * Server 3 joins ensemble and starts looking for a leader. * Connections between server 3 and 2/1/0 were broken for some reasons (unclear to me, but it happen on both failed and succeeded cases.). * Server 3 restarts leader election (happens on both failed and succeed cases.). * The first thing when restart leader election is to shutdown the old FLE, where server 3 halts (when joining listener thread.) in failed cases. From this point, server 3 is left in a bad state and would never recover (increase timeout would not help). This also aligns with some of observations previously pointed out by Alex and Akihiro. Fix ZOOKEEPER-2246 might fix this as well, so I assigned that issue to myself. Working on a patch now (which, not might require get ZOOKEEPER-900 done first, we will see.). > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Michael Han > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321959#comment-15321959 ] Raul Gutierrez Segales commented on ZOOKEEPER-2080: --- [~hanm]: sure — go for it. Thanks! > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321770#comment-15321770 ] Michael Han commented on ZOOKEEPER-2080: I'd like to take a stab here and assign this issue to myself. Ok with you [~rgs]? > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14747006#comment-14747006 ] Akihiro Suda commented on ZOOKEEPER-2080: - We don't need to use {{java.nio.channels.SocketChannel}} (dicussed in [ZOOKEEPER-2246|https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14746992&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14746992]), but seem to need redesign of QCM (ZOOKEEPER-901). ZOOKEEPER-2164 seems also related. So I link ZOOKEEPER-901 and ZOOKEEPER-2164 , and promote priority from Minor to Major. If this is inappropriate, please feel free to revert this change. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740513#comment-14740513 ] Akihiro Suda commented on ZOOKEEPER-2080: - Looking at JaCoCo reports, I also noticed that [{{QCM.SendWorker#finish()}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L352-L360] (and hence {{QCM.RecvWorker#finish()}}) in {{QCM#receiveConnection()}} ({{sid < self.getId()}}) is called only on failed experiments. When I comment out this, the bug got hard to be reproduced. So I belive that the bug is caused by *a race condition between TCP packet arrivals and {{SendWorker}}/{{RecvWorker}} lifecycles*. Especially, the socket handling in [{{QCM.RecvWorker#run}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L893-L926] is *very suspicious*, as it cannot be interrupted nor timed out. (Should use {{java.nio.channels.SocketChannel}} rather than plain old {{java.net.Socket}}.) Note that the bug also got hard to be reproduced when I comment out [{{Socket#setTcpNoDelay(true)}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L566] in {{QCM#setSockOpts()}} (as I reported on Aug 14), or use {{BufferedOutputStream}} instead of {{DataOutputStream}} in [{{QCM.SendWorker()}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L719]. [~shralex], can I have your opinion on this? > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, > repro-20150816.log > > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698003#comment-14698003 ] Alexander Shraer commented on ZOOKEEPER-2080: - https://builds.apache.org/job/ZooKeeper-trunk/2783/testReport/org.apache.zookeeper.server.quorum/ReconfigRecoveryTest/testCurrentObserverIsParticipantInNewConfig/ > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14698002#comment-14698002 ] Alexander Shraer commented on ZOOKEEPER-2080: - I looked at the logs from two recent failures of this test on hudson. Here's what I see in both: 2015-08-13 12:11:05,321 [myid:3] - INFO [QuorumPeer[myid=3](plain=/127.0.0.1:24740)(secure=disabled):QuorumPeer@986] - LOOKING 2015-08-13 12:11:05,321 [myid:3] - INFO [QuorumPeer[myid=3](plain=/127.0.0.1:24740)(secure=disabled):FastLeaderElection@894] - New election. My id = 3, proposed zxid=0x0 2015-08-13 12:11:05,324 [myid:3] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@631] - My election bind port: localhost/127.0.0.1:24739 2015-08-13 12:11:05,331 [myid:3] - INFO [localhost/127.0.0.1:24739:QuorumCnxManager$Listener@637] - Received connection request /127.0.0.1:50282 2015-08-13 12:11:05,331 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection$Messenger$WorkerReceiver@294] - 3 Received version: 1 my version: 0 2015-08-13 12:11:05,337 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection$Messenger$WorkerReceiver@301] - restarting leader election 2015-08-13 12:11:05,540 [myid:3] - WARN [RecvWorker:0:QuorumCnxManager$RecvWorker@920] - Interrupting SendWorker 2015-08-13 12:11:05,542 [myid:3] - WARN [SendWorker:0:QuorumCnxManager$SendWorker@834] - Interrupted while waiting for message on queue 2015-08-13 12:11:05,544 [myid:3] - WARN [SendWorker:0:QuorumCnxManager$SendWorker@843] - Send worker leaving thread id 0 my id = 3 2015-08-13 12:11:06,411 [myid:3] - INFO [NIOServerCxnFactory.AcceptThread:localhost/127.0.0.1:24740:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:42161 2015-08-13 12:11:06,412 [myid:3] - WARN [NIOWorkerThread-1:NIOServerCnxn@369] - Exception causing close of session 0x0: ZooKeeperServer not running 2015-08-13 12:11:06,412 [myid:3] - INFO [NIOWorkerThread-1:NIOServerCnxn@606] - Closed socket connection for client /127.0.0.1:42161 (no se and then the last 3 messages are basically repeated until the end of the test. So - after restart attempt in fast leader election, ZooKeeperServer is not running > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696525#comment-14696525 ] Alexander Shraer commented on ZOOKEEPER-2080: - Thanks for continuing to work on this! Could you please post the server logs from an execution where it fails ? > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696437#comment-14696437 ] Akihiro Suda commented on ZOOKEEPER-2080: - The bug can be almost always reproduced by injecting 80 msecs delay to every FLE packets with my tool: https://github.com/osrg/earthquake/tree/9078c5b039762f6c201ee036ac3453caf6168055/example/zk-repro-2080.nfqhook When I comment out [{{Socket#setTcpNoDelay(true)}}|https://github.com/apache/zookeeper/blob/5b1b668d33ccf7d93c31db2a53728177393fea90/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L566] in {{QCnxM#setSockOpts()}}, the bug gets hard to be reproduced. So I guess the bug is caused by a race condition in {{QCnxM}} (or in {{FLE}}). Anyone can give us some advice about suspicious point in {{QCnxM}}? ZOOKEEPER-2246 might be related to 2080, but just applying [this fix|https://issues.apache.org/jira/browse/ZOOKEEPER-2246?focusedCommentId=14694804&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14694804] proposed in 2246 does not resolve 2080. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575180#comment-14575180 ] Chris Nauroth commented on ZOOKEEPER-2080: -- We've also seen some intermittent failures from the similar CppUnit tests: TestReconfig.cc and TestReconfigServer.cc. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526272#comment-14526272 ] Alexander Shraer commented on ZOOKEEPER-2080: - I see. thanks for looking into this. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526269#comment-14526269 ] Raul Gutierrez Segales commented on ZOOKEEPER-2080: --- Actually, it's expected since we call waitForServerUp on almost all tests, and that calls the 'stat' 4-letter cmd: https://github.com/apache/zookeeper/blob/trunk/src/java/test/org/apache/zookeeper/test/ClientBase.java#L242 > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526266#comment-14526266 ] Raul Gutierrez Segales commented on ZOOKEEPER-2080: --- Hmmm, interesting. Maybe there is a straggler thread generating those. I'll compare with a successful build. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526264#comment-14526264 ] Alexander Shraer commented on ZOOKEEPER-2080: - I also couldn't reproduce, and we don't have access to the build machine. Increasing timeout sounds reasonable. Do you know why there are so many "Processing stat command from" messages in the log ? is this normal ? > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526245#comment-14526245 ] Raul Gutierrez Segales commented on ZOOKEEPER-2080: --- It passes pretty consistently for me locally: {noformat} ~/src/zookeeper-svn (ZOOKEEPER-2171) ✔ ant -Dtestcase=ReconfigRecoveryTest test-core-java^ BUILD SUCCESSFUL Total time: 3 minutes 28 seconds {noformat} So I am tempted to propose a patch that raises the wait timeout for 'waiting for server 2 being up' and call it a day, unless someone has an proposal to refactor this test and make it more robust all together. Thoughts? > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Raul Gutierrez Segales >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14369593#comment-14369593 ] Ted Yu commented on ZOOKEEPER-2080: --- Looks like the test doesn't fail recently. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Sub-task >Reporter: Ted Yu >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209836#comment-14209836 ] Flavio Junqueira commented on ZOOKEEPER-2080: - I think this is the same that [~phunt] observed in ZOOKEEPER-1807. I have reopened that one. > ReconfigRecoveryTest fails intermittently > - > > Key: ZOOKEEPER-2080 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080 > Project: ZooKeeper > Issue Type: Test >Reporter: Ted Yu >Priority: Minor > > I got the following test failure on MacBook with trunk code: > {code} > Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec > FAILED > waiting for server 2 being up > junit.framework.AssertionFailedError: waiting for server 2 being up > at > org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)