Failed: ZOOKEEPER-1140 PreCommit Build #478
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-1140 Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 207477 lines...] [exec] [exec] -1 overall. Here are the results of testing the latest attachment [exec] http://issues.apache.org/jira/secure/attachment/12490942/ZOOKEEPER-1140.patch [exec] against trunk revision 1163015. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] 5D38WioebT logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] BUILD FAILED /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/build.xml:1450: exec returned: 1 Total time: 22 minutes 31 seconds Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results Description set: ZOOKEEPER-1140 Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (ZOOKEEPER-1140) server shutdown is not stopping threads
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093470#comment-13093470 ] Hadoop QA commented on ZOOKEEPER-1140: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12490942/ZOOKEEPER-1140.patch against trunk revision 1163015. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//console This message is automatically generated. server shutdown is not stopping threads --- Key: ZOOKEEPER-1140 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1140 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.0 Reporter: Patrick Hunt Assignee: Laxman Priority: Blocker Fix For: 3.4.0 Attachments: ZOOKEEPER-1140.patch Near the end of QuorumZxidSyncTest there are tons of threads running - 115 ProcessThread threads, similar numbers of SessionTracker. Also I see ~100 ReadOnlyRequestProcessor - why is this running as a separate thread? (henry/flavio?) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-1165) better eclipse support in tests
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-1165: - Fix Version/s: (was: 3.4.0) 3.5.0 Not a blocker. Moving it out! better eclipse support in tests --- Key: ZOOKEEPER-1165 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1165 Project: ZooKeeper Issue Type: Bug Components: tests Affects Versions: 3.4.0 Environment: Eclipse Reporter: Warren Turkal Assignee: Warren Turkal Priority: Minor Labels: patch Fix For: 3.5.0 Attachments: BaseSysTest.java.patch Original Estimate: 1h Remaining Estimate: 1h The Eclipse test runner tries to run tests from all classes that inherit from TestCase. However, this class is inherited by at least one class (org.apache.zookeeper.test.system.BaseSysTest) that has no test cases as it is used as infrastructure for other real test cases. This patch annotates that class with @Ignore, which causes the class to be Ignored. Also, due to the way annotations are not inherited by default, this patch will not affect classes that inherit from this class. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1136) NEW_LEADER should be queued not sent to match the Zab 1.0 protocol on the twiki
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093490#comment-13093490 ] Mahadev konar commented on ZOOKEEPER-1136: -- Ben, Any update on this? NEW_LEADER should be queued not sent to match the Zab 1.0 protocol on the twiki --- Key: ZOOKEEPER-1136 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1136 Project: ZooKeeper Issue Type: Bug Reporter: Benjamin Reed Assignee: Benjamin Reed Priority: Blocker Fix For: 3.3.4, 3.4.0 Attachments: ZOOKEEPER-1136.patch, ZOOKEEPER-1136.patch the NEW_LEADER message was sent at the beginning of the sync phase in Zab pre1.0, but it must be at the end in Zab 1.0. if the protocol is 1.0 or greater we need to queue rather than send the packet. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (ZOOKEEPER-847) Missing acl check in zookeeper create
[ https://issues.apache.org/jira/browse/ZOOKEEPER-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Laxman reassigned ZOOKEEPER-847: Assignee: Laxman (was: Thomas Koch) Missing acl check in zookeeper create - Key: ZOOKEEPER-847 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-847 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.1 Reporter: Patrick Datko Assignee: Laxman I watched the source of the zookeeper class and I missed an acl check in the asynchronous version of the create operation. Is there any reason, that in the asynch version is no check whether the acl is valid, or did someone forget to implement it. It's interesting because we worked on a refactoring of the zookeeper client and don't want to implement a bug. The following code is missing: if (acl != null acl.size() == 0) { throw new KeeperException.InvalidACLException(); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (ZOOKEEPER-851) ZK lets any node to become an observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Laxman reassigned ZOOKEEPER-851: Assignee: Laxman ZK lets any node to become an observer -- Key: ZOOKEEPER-851 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851 Project: ZooKeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.3.1 Reporter: Vishal Kher Assignee: Laxman Priority: Critical Fix For: 3.5.0 Attachments: ZOOKEEPER-851.patch I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as show below: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I created another entry for that node and started zk server. The zoo.cfg on the first 3 nodes was left unchanged. The fourth node was able to join the cluster even though the 3 nodes had no idea about the fourth node. zoo.cfg on fourth node: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 server.3=10.17.117.71:2888:3888 It looks like 10.17.117.71 is becoming an observer in this case. I was expecting that the leader will reject 10.17.117.71. # telnet 10.17.117.71 2181 Trying 10.17.117.71... Connected to 10.17.117.71. Escape character is '^]'. stat Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT Clients: /10.17.117.71:37297[1](queued=0,recved=1,sent=0) Latency min/avg/max: 0/0/0 Received: 3 Sent: 2 Outstanding: 0 Zxid: 0x20065 Mode: follower Node count: 288 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-851) ZK lets any node to become an observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Laxman updated ZOOKEEPER-851: - Attachment: ZOOKEEPER-851.patch ZK lets any node to become an observer -- Key: ZOOKEEPER-851 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851 Project: ZooKeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.3.1 Reporter: Vishal Kher Assignee: Laxman Priority: Critical Fix For: 3.5.0 Attachments: ZOOKEEPER-851.patch I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as show below: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I created another entry for that node and started zk server. The zoo.cfg on the first 3 nodes was left unchanged. The fourth node was able to join the cluster even though the 3 nodes had no idea about the fourth node. zoo.cfg on fourth node: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 server.3=10.17.117.71:2888:3888 It looks like 10.17.117.71 is becoming an observer in this case. I was expecting that the leader will reject 10.17.117.71. # telnet 10.17.117.71 2181 Trying 10.17.117.71... Connected to 10.17.117.71. Escape character is '^]'. stat Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT Clients: /10.17.117.71:37297[1](queued=0,recved=1,sent=0) Latency min/avg/max: 0/0/0 Received: 3 Sent: 2 Outstanding: 0 Zxid: 0x20065 Mode: follower Node count: 288 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (ZOOKEEPER-847) Missing acl check in zookeeper create
[ https://issues.apache.org/jira/browse/ZOOKEEPER-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Laxman updated ZOOKEEPER-847: - Attachment: ZOOKEEPER-847.patch Missing acl check in zookeeper create - Key: ZOOKEEPER-847 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-847 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.1 Reporter: Patrick Datko Assignee: Laxman Attachments: ZOOKEEPER-847.patch I watched the source of the zookeeper class and I missed an acl check in the asynchronous version of the create operation. Is there any reason, that in the asynch version is no check whether the acl is valid, or did someone forget to implement it. It's interesting because we worked on a refactoring of the zookeeper client and don't want to implement a bug. The following code is missing: if (acl != null acl.size() == 0) { throw new KeeperException.InvalidACLException(); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1140) server shutdown is not stopping threads
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093560#comment-13093560 ] Laxman commented on ZOOKEEPER-1140: --- Thanks for review and commit Mahadev. server shutdown is not stopping threads --- Key: ZOOKEEPER-1140 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1140 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.0 Reporter: Patrick Hunt Assignee: Laxman Priority: Blocker Fix For: 3.4.0 Attachments: ZOOKEEPER-1140.patch Near the end of QuorumZxidSyncTest there are tons of threads running - 115 ProcessThread threads, similar numbers of SessionTracker. Also I see ~100 ReadOnlyRequestProcessor - why is this running as a separate thread? (henry/flavio?) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Failed: ZOOKEEPER-851 PreCommit Build #479
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-851 Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/479/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 44938 lines...] [exec] [exec] -1 overall. Here are the results of testing the latest attachment [exec] http://issues.apache.org/jira/secure/attachment/12492214/ZOOKEEPER-851.patch [exec] against trunk revision 1163106. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] -1 javac. The applied patch generated 11 javac compiler warnings (more than the trunk's current 10 warnings). [exec] [exec] -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] -1 core tests. The patch failed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/479//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/479//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/479//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] I65qc2Tmvb logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] BUILD FAILED /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/build.xml:1450: exec returned: 4 Total time: 7 minutes 34 seconds Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results Description set: ZOOKEEPER-851 Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 20 tests failed. REGRESSION: org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testQuorum Error Message: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. Stack Trace: junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. REGRESSION: org.apache.zookeeper.test.AsyncHammerTest.testHammer Error Message: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. Stack Trace: junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. REGRESSION: org.apache.zookeeper.test.AsyncTest.testAsync Error Message: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. Stack Trace: junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. REGRESSION: org.apache.zookeeper.test.CnxManagerTest.testWorkerThreads Error Message: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. Stack Trace: junit.framework.AssertionFailedError: Forked Java VM exited abnormally. Please note the time in the report does not reflect the time until the VM exit. REGRESSION:
[jira] [Commented] (ZOOKEEPER-851) ZK lets any node to become an observer
[ https://issues.apache.org/jira/browse/ZOOKEEPER-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093564#comment-13093564 ] Hadoop QA commented on ZOOKEEPER-851: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12492214/ZOOKEEPER-851.patch against trunk revision 1163106. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 11 javac compiler warnings (more than the trunk's current 10 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/479//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/479//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/479//console This message is automatically generated. ZK lets any node to become an observer -- Key: ZOOKEEPER-851 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-851 Project: ZooKeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.3.1 Reporter: Vishal Kher Assignee: Laxman Priority: Critical Fix For: 3.4.0, 3.5.0 Attachments: ZOOKEEPER-851.patch I had a 3 node cluster running. The zoo.cfg on each contained 3 entries as show below: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 I wanted to add another node to the cluster. In fourth node's zoo.cfg, I created another entry for that node and started zk server. The zoo.cfg on the first 3 nodes was left unchanged. The fourth node was able to join the cluster even though the 3 nodes had no idea about the fourth node. zoo.cfg on fourth node: tickTime=2000 dataDir=/var/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.0=10.150.27.61:2888:3888 server.1=10.150.27.62:2888:3888 server.2=10.150.27.63:2888:3888 server.3=10.17.117.71:2888:3888 It looks like 10.17.117.71 is becoming an observer in this case. I was expecting that the leader will reject 10.17.117.71. # telnet 10.17.117.71 2181 Trying 10.17.117.71... Connected to 10.17.117.71. Escape character is '^]'. stat Zookeeper version: 3.3.0--1, built on 04/02/2010 22:40 GMT Clients: /10.17.117.71:37297[1](queued=0,recved=1,sent=0) Latency min/avg/max: 0/0/0 Received: 3 Sent: 2 Outstanding: 0 Zxid: 0x20065 Mode: follower Node count: 288 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Success: ZOOKEEPER-847 PreCommit Build #480
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-847 Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/480/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 208507 lines...] [exec] BUILD SUCCESSFUL [exec] Total time: 0 seconds [exec] [exec] [exec] [exec] [exec] +1 overall. Here are the results of testing the latest attachment [exec] http://issues.apache.org/jira/secure/attachment/12492215/ZOOKEEPER-847.patch [exec] against trunk revision 1163106. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 12 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/480//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/480//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/480//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] 5S87Zk49Dl logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] BUILD SUCCESSFUL Total time: 24 minutes 47 seconds Archiving artifacts Recording test results Description set: ZOOKEEPER-847 Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (ZOOKEEPER-847) Missing acl check in zookeeper create
[ https://issues.apache.org/jira/browse/ZOOKEEPER-847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093568#comment-13093568 ] Hadoop QA commented on ZOOKEEPER-847: - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12492215/ZOOKEEPER-847.patch against trunk revision 1163106. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/480//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/480//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/480//console This message is automatically generated. Missing acl check in zookeeper create - Key: ZOOKEEPER-847 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-847 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.3.1, 3.3.2, 3.3.3 Reporter: Patrick Datko Assignee: Laxman Fix For: 3.4.0 Attachments: ZOOKEEPER-847.patch I watched the source of the zookeeper class and I missed an acl check in the asynchronous version of the create operation. Is there any reason, that in the asynch version is no check whether the acl is valid, or did someone forget to implement it. It's interesting because we worked on a refactoring of the zookeeper client and don't want to implement a bug. The following code is missing: if (acl != null acl.size() == 0) { throw new KeeperException.InvalidACLException(); } -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1140) server shutdown is not stopping threads
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093647#comment-13093647 ] Hudson commented on ZOOKEEPER-1140: --- Integrated in ZooKeeper-trunk #1288 (See [https://builds.apache.org/job/ZooKeeper-trunk/1288/]) ZOOKEEPER-1140. server shutdown is not stopping threads. (laxman via mahadev) mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1163102 Files : * /zookeeper/trunk/CHANGES.txt * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ReadOnlyZooKeeperServer.java server shutdown is not stopping threads --- Key: ZOOKEEPER-1140 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1140 Project: ZooKeeper Issue Type: Bug Components: server, tests Affects Versions: 3.4.0 Reporter: Patrick Hunt Assignee: Laxman Priority: Blocker Fix For: 3.4.0 Attachments: ZOOKEEPER-1140.patch Near the end of QuorumZxidSyncTest there are tons of threads running - 115 ProcessThread threads, similar numbers of SessionTracker. Also I see ~100 ReadOnlyRequestProcessor - why is this running as a separate thread? (henry/flavio?) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1051) SIGPIPE in Zookeeper 0.3.* when send'ing after cluster disconnection
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093644#comment-13093644 ] Hudson commented on ZOOKEEPER-1051: --- Integrated in ZooKeeper-trunk #1288 (See [https://builds.apache.org/job/ZooKeeper-trunk/1288/]) ZOOKEEPER-1051. SIGPIPE in Zookeeper 0.3.* when send'ing after cluster disconnection (Stephen Tyree via mahadev) mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1163106 Files : * /zookeeper/trunk/CHANGES.txt * /zookeeper/trunk/src/c/src/zookeeper.c SIGPIPE in Zookeeper 0.3.* when send'ing after cluster disconnection Key: ZOOKEEPER-1051 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1051 Project: ZooKeeper Issue Type: Bug Components: c client Affects Versions: 3.3.2, 3.3.3, 3.4.0 Reporter: Stephen Tyree Assignee: Stephen Tyree Priority: Minor Fix For: 3.4.0 Attachments: ZOOKEEPER-1051.patch, ZOOKEEPER-1051.patch Original Estimate: 2h Remaining Estimate: 2h In libzookeeper_mt, if your process is going rather slowly (such as when running it in Valgrind's Memcheck) or you are using gdb with breakpoints, you can occasionally get SIGPIPE when trying to send a message to the cluster. For example: ==12788== ==12788== Process terminating with default action of signal 13 (SIGPIPE) ==12788==at 0x3F5180DE91: send (in /lib64/libpthread-2.5.so) ==12788==by 0x7F060AA: ??? (in /usr/lib64/libzookeeper_mt.so.2.0.0) ==12788==by 0x7F06E5B: zookeeper_process (in /usr/lib64/libzookeeper_mt.so.2.0.0) ==12788==by 0x7F0D38E: ??? (in /usr/lib64/libzookeeper_mt.so.2.0.0) ==12788==by 0x3F5180673C: start_thread (in /lib64/libpthread-2.5.so) ==12788==by 0x3F50CD3F6C: clone (in /lib64/libc-2.5.so) ==12788== This is probably not the behavior we would like, since we handle server disconnections after a failed call to send. To fix this, there are a few options we could use. For BSD environments, we can tell a socket to never send SIGPIPE with send using setsockopt: setsockopt(sd, SOL_SOCKET, SO_NOSIGPIPE, (void *)set, sizeof(int)); For Linux environments, we can add a MSG_NOSIGNAL flag to every send call that says to not send SIGPIPE on a bad file descriptor. For more information, see: http://stackoverflow.com/questions/108183/how-to-prevent-sigpipes-or-handle-them-properly -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-999) Create an package integration project
[ https://issues.apache.org/jira/browse/ZOOKEEPER-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093645#comment-13093645 ] Hudson commented on ZOOKEEPER-999: -- Integrated in ZooKeeper-trunk #1288 (See [https://builds.apache.org/job/ZooKeeper-trunk/1288/]) ZOOKEEPER-999. Create an package integration project (Eric Yang via phunt) phunt : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1163015 Files : * /zookeeper/trunk/CHANGES.txt Create an package integration project - Key: ZOOKEEPER-999 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-999 Project: ZooKeeper Issue Type: New Feature Components: build Environment: Java 6, RHEL/Ubuntu Reporter: Eric Yang Assignee: Eric Yang Fix For: 3.4.0 Attachments: ZOOKEEPER-999-1.patch, ZOOKEEPER-999-10.patch, ZOOKEEPER-999-11.patch, ZOOKEEPER-999-12.patch, ZOOKEEPER-999-13.patch, ZOOKEEPER-999-2.patch, ZOOKEEPER-999-3.patch, ZOOKEEPER-999-4.patch, ZOOKEEPER-999-5.patch, ZOOKEEPER-999-6.patch, ZOOKEEPER-999-7.patch, ZOOKEEPER-999-8.patch, ZOOKEEPER-999-9.patch, ZOOKEEPER-999.patch This goal of this ticket is to generate a set of RPM/debian package which integrate well with RPM sets created by HADOOP-6255. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (ZOOKEEPER-1153) Deprecate AuthFLE and LE
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093646#comment-13093646 ] Hudson commented on ZOOKEEPER-1153: --- Integrated in ZooKeeper-trunk #1288 (See [https://builds.apache.org/job/ZooKeeper-trunk/1288/]) ZOOKEEPER-1153. Deprecate AuthFLE and LE. (Flavio Junqueira via mahadev) mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1163099 Files : * /zookeeper/trunk/CHANGES.txt * /zookeeper/trunk/docs/bookkeeperConfig.pdf * /zookeeper/trunk/docs/bookkeeperOverview.pdf * /zookeeper/trunk/docs/bookkeeperProgrammer.pdf * /zookeeper/trunk/docs/bookkeeperStarted.pdf * /zookeeper/trunk/docs/bookkeeperStream.pdf * /zookeeper/trunk/docs/index.pdf * /zookeeper/trunk/docs/javaExample.pdf * /zookeeper/trunk/docs/linkmap.pdf * /zookeeper/trunk/docs/recipes.pdf * /zookeeper/trunk/docs/releasenotes.pdf * /zookeeper/trunk/docs/zookeeperAdmin.html * /zookeeper/trunk/docs/zookeeperAdmin.pdf * /zookeeper/trunk/docs/zookeeperHierarchicalQuorums.pdf * /zookeeper/trunk/docs/zookeeperInternals.pdf * /zookeeper/trunk/docs/zookeeperJMX.pdf * /zookeeper/trunk/docs/zookeeperObservers.pdf * /zookeeper/trunk/docs/zookeeperOver.pdf * /zookeeper/trunk/docs/zookeeperProgrammers.pdf * /zookeeper/trunk/docs/zookeeperQuotas.pdf * /zookeeper/trunk/docs/zookeeperStarted.pdf * /zookeeper/trunk/docs/zookeeperTutorial.pdf * /zookeeper/trunk/src/docs/src/documentation/content/xdocs/zookeeperAdmin.xml * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/AuthFastLeaderElection.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LeaderElection.java * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java Deprecate AuthFLE and LE Key: ZOOKEEPER-1153 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1153 Project: ZooKeeper Issue Type: Improvement Affects Versions: 3.3.3 Reporter: Flavio Junqueira Assignee: Flavio Junqueira Fix For: 3.4.0 Attachments: ZOOKEEPER-1153.patch, ZOOKEEPER-1153.patch I propose we mark these as deprecated in 3.4.0 and remove them in the following release. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: How zab avoid split-brain problem?
Hi Peter, It's the second option. The servers don't know if the leader failed or was partitioned from them. So each group of 3 servers in your scenario can't distinguish the situation from another scenario where none of the servers failed but these 3 servers are partitioned from the other 4. To prevent a split brain in an asynchronous network a leader must have the support of a quorum. Alex -Original Message- From: cheetah [mailto:xuw...@gmail.com] Sent: Tuesday, August 30, 2011 12:23 AM To: dev@zookeeper.apache.org Subject: How zab avoid split-brain problem? Hi folks, I am reading the zab paper, but a bit confusing how zab handle split brain problem. Suppose there are A, B, C, D, E, F and G seven servers, now A is the leader. When A dies and at the same time, B,C,D are isolated from E, F and G. In this case, will Zab continue working like this: if BCD and EFG, so the two groups are both voting and electing B and E as their leaders separately. Thus, there is a split brain problem. Or Zookeeper just stop working, because there were original 7 servers, after 1 failure, a new leader still expects to have a quorum of 3 servers voting for it as the leader. And because the two groups are separate from each other, no leader can be elected out. If it is the first case, Zookeeper will have a split brain problem, which probably is not the case. But in the second case, a 7-node Zookeeper service can only handle a node failure and a network partition failure. Am I understanding wrongly? Looking forward to your insights. Thanks, Peter
[jira] [Commented] (ZOOKEEPER-706) large numbers of watches can cause session re-establishment to fail
[ https://issues.apache.org/jira/browse/ZOOKEEPER-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13094136#comment-13094136 ] Eric Hwang commented on ZOOKEEPER-706: -- Any idea if the jute.maxbuffer setting needs to be applied to both server and client? or just client? large numbers of watches can cause session re-establishment to fail --- Key: ZOOKEEPER-706 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-706 Project: ZooKeeper Issue Type: Bug Components: c client, java client Affects Versions: 3.1.2, 3.2.2, 3.3.0 Reporter: Patrick Hunt Priority: Critical Fix For: 3.5.0 If a client sets a large number of watches the set watches operation during session re-establishment can fail. for example: WARN [NIOServerCxn.Factory:22801:NIOServerCnxn@417] - Exception causing close of session 0xe727001201a4ee7c due to java.io.IOException: Len error 4348380 in this case the client was a web monitoring app and had set both data and child watches on 32k znodes. there are two issues I see here we need to fix: 1) handle this case properly (split up the set watches into multiple calls I guess...) 2) the session should have expired after the timeout. however we seem to consider any message from the client as re-setting the expiration on the server side. Probably we should only consider messages from the client that are sent during an established session, otherwise we can see this situation where the session is not established however the session is not expired either. Perhaps we should create another JIRA for this particular issue. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: How zab avoid split-brain problem?
Hi Alex, Thanks for the explanation. Then I have another question: If there are 7 machines in my current zookeeper clusters, two of them are failed. How can I reconfigure the Zookeeper to make it working with 5 machines? i.e if the master can get 3 machines' reply, it can commit the transaction. On the other hand, if I add 2 machines to make a 9 node Zookeeper cluster, how can I configure it to make it taking advantages of 9 machines? This is more related to user mailing list. So I cc to it. Thanks, Peter On Tue, Aug 30, 2011 at 12:21 PM, Alexander Shraer shra...@yahoo-inc.comwrote: Hi Peter, It's the second option. The servers don't know if the leader failed or was partitioned from them. So each group of 3 servers in your scenario can't distinguish the situation from another scenario where none of the servers failed but these 3 servers are partitioned from the other 4. To prevent a split brain in an asynchronous network a leader must have the support of a quorum. Alex -Original Message- From: cheetah [mailto:xuw...@gmail.com] Sent: Tuesday, August 30, 2011 12:23 AM To: dev@zookeeper.apache.org Subject: How zab avoid split-brain problem? Hi folks, I am reading the zab paper, but a bit confusing how zab handle split brain problem. Suppose there are A, B, C, D, E, F and G seven servers, now A is the leader. When A dies and at the same time, B,C,D are isolated from E, F and G. In this case, will Zab continue working like this: if BCD and EFG, so the two groups are both voting and electing B and E as their leaders separately. Thus, there is a split brain problem. Or Zookeeper just stop working, because there were original 7 servers, after 1 failure, a new leader still expects to have a quorum of 3 servers voting for it as the leader. And because the two groups are separate from each other, no leader can be elected out. If it is the first case, Zookeeper will have a split brain problem, which probably is not the case. But in the second case, a 7-node Zookeeper service can only handle a node failure and a network partition failure. Am I understanding wrongly? Looking forward to your insights. Thanks, Peter
RE: How zab avoid split-brain problem?
Hi Peter, We're currently working on adding dynamic reconfiguration functionality to Zookeeper. I hope that it will get in to the next release of ZK (after 3.4). With this you'll just run a new zk command to add/remove any servers, change ports, change roles (followers/observers), etc. Currently, membership is determined by the config file so the only way of doing this is rolling restart. This means that you change configuration files and bounce the servers back. You should do it in a way that guarantees that at any time any quorum of the servers that are up intersects with any quorum of the old configuration (otherwise you might lose data). For example, if you're going from (A, B, C) to (A, B, C, D, E), it is possible that A and B have the latest data whereas C is falling behind (ZK stores data on a quorum), so if you just change the config files of A, B, C to say that they are part of the larger configuration then C might be elected with the support of D and E and you might lose data. So in this case you'll have to first add D, and later add E, this way the quorums intersect. Same thing when removing servers. Alex -Original Message- From: cheetah [mailto:xuw...@gmail.com] Sent: Tuesday, August 30, 2011 3:36 PM To: dev@zookeeper.apache.org Cc: u...@zookeeper.apache.org Subject: Re: How zab avoid split-brain problem? Hi Alex, Thanks for the explanation. Then I have another question: If there are 7 machines in my current zookeeper clusters, two of them are failed. How can I reconfigure the Zookeeper to make it working with 5 machines? i.e if the master can get 3 machines' reply, it can commit the transaction. On the other hand, if I add 2 machines to make a 9 node Zookeeper cluster, how can I configure it to make it taking advantages of 9 machines? This is more related to user mailing list. So I cc to it. Thanks, Peter On Tue, Aug 30, 2011 at 12:21 PM, Alexander Shraer shralex@yahoo- inc.comwrote: Hi Peter, It's the second option. The servers don't know if the leader failed or was partitioned from them. So each group of 3 servers in your scenario can't distinguish the situation from another scenario where none of the servers failed but these 3 servers are partitioned from the other 4. To prevent a split brain in an asynchronous network a leader must have the support of a quorum. Alex -Original Message- From: cheetah [mailto:xuw...@gmail.com] Sent: Tuesday, August 30, 2011 12:23 AM To: dev@zookeeper.apache.org Subject: How zab avoid split-brain problem? Hi folks, I am reading the zab paper, but a bit confusing how zab handle split brain problem. Suppose there are A, B, C, D, E, F and G seven servers, now A is the leader. When A dies and at the same time, B,C,D are isolated from E, F and G. In this case, will Zab continue working like this: if BCD and EFG, so the two groups are both voting and electing B and E as their leaders separately. Thus, there is a split brain problem. Or Zookeeper just stop working, because there were original 7 servers, after 1 failure, a new leader still expects to have a quorum of 3 servers voting for it as the leader. And because the two groups are separate from each other, no leader can be elected out. If it is the first case, Zookeeper will have a split brain problem, which probably is not the case. But in the second case, a 7-node Zookeeper service can only handle a node failure and a network partition failure. Am I understanding wrongly? Looking forward to your insights. Thanks, Peter
Re: NodeExistsException when creating a znode with sequential and ephemeral mode
Camille We applied the patch (ZOOKEEPER-1046-for333) to our SUT. There was no error. Thanks alex 2011년 8월 30일 오전 11:19, 박영근(Alex) alex.p...@nexr.com님의 말: We used 3.3.3. We will check out the latest code. Thanks Camille. Alex 2011/8/30 Camille Fournier cami...@apache.org More specifically, we fixed this for the upcoming release: https://issues.apache.org/jira/browse/ZOOKEEPER-1046 You can try checking out the latest code and building it, should fix your error. I believe 3.3.4 will be released in a week or two. c On Mon, Aug 29, 2011 at 9:56 PM, Camille Fournier cami...@apache.org wrote: What version of ZK were you using? On Mon, Aug 29, 2011 at 9:50 PM, 박영근(Alex) alex.p...@nexr.com wrote: Hi, all I met a problem of NodeExistsException when creating a znode with sequential and ephemeral mode. the number of total test was 6442314 and 797 errors had occurred. The related log message is as in the following: 2011-08-27 16:26:17,559 - INFO [ProcessThread:-1:PrepRequestProcessor@407][] - Got user-level KeeperException when processing sessionid:0x2320911802a0002 type:create cxid:0x1246d7 zxid:0xfffe txntype:unknown reqpath:n/a Error Path:/NexR/MasteElection/__rwLock/readLock-lssm07-0005967078 Error:KeeperErrorCode = NodeExists for /NexR/MasteElection/__rwLock/readLock-lssm07-0005967078 The sequential number would be created by increasing parent's Cversion in the PrepRequestProcess. So, I guess that this problem was caused by inconsistency of parent znode. Our test scenario is very aggressive: The grinder agent sends a request of creating a znode of CreateMode. SEQUENTIAL_EPHEMERAL. three number of servers compose ensemble. each NIC of server is down and up repeatedly; NIC of server1 become down every one minute and sleeping for 9 seconds, then up NIC of server2 become down every 2 minute and sleeping for 9 seconds, then up NIC of server3 become down every 3 minute and sleeping for 9 seconds, then up while the probability of error occurrence is 0.0001 as mentioned above, if the ZooKeeper cannot guarantee the consistency, it is a fatal. Is there any idea or related issue? thanks in advance. alex.
[jira] [Created] (BOOKKEEPER-61) BufferedChannel read endless when the remaining bytes of file is less than the capacity of read buffer
BufferedChannel read endless when the remaining bytes of file is less than the capacity of read buffer -- Key: BOOKKEEPER-61 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-61 Project: Bookkeeper Issue Type: Bug Components: bookkeeper-server Affects Versions: 3.4.0 Reporter: Sijie Guo If last record in entry log file is truncated (length of data is short than expected length), bookie went into infinite loop on reading this record. A truncated record can be caused in following cases: 1) bookie server is killed during bookie restart to relay logs. 2) bookie server is killed when bookie does adding entry operation. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (BOOKKEEPER-61) BufferedChannel read endless when the remaining bytes of file is less than the capacity of read buffer
[ https://issues.apache.org/jira/browse/BOOKKEEPER-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sijie Guo updated BOOKKEEPER-61: Attachment: bookkeeper-61.patch return number of bytes been read when reach the end of file BufferedChannel read endless when the remaining bytes of file is less than the capacity of read buffer -- Key: BOOKKEEPER-61 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-61 Project: Bookkeeper Issue Type: Bug Components: bookkeeper-server Affects Versions: 3.4.0 Reporter: Sijie Guo Attachments: bookkeeper-61.patch If last record in entry log file is truncated (length of data is short than expected length), bookie went into infinite loop on reading this record. A truncated record can be caused in following cases: 1) bookie server is killed during bookie restart to relay logs. 2) bookie server is killed when bookie does adding entry operation. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (BOOKKEEPER-56) Race condition of message handler in connection recovery in Hedwig client
[ https://issues.apache.org/jira/browse/BOOKKEEPER-56?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13093662#comment-13093662 ] Ivan Kelly commented on BOOKKEEPER-56: -- It seems inelegant to have to look up the delivery handler every time, when the message has already arrived in an object which can know how to deliver it. Perhaps we could add a package private method on HedwigSubscriber, called restartDelivery, which gets the handler from the hashmap and sets it in the response handler. In this case, the patch wouldn't modify the response handler at all, just how the reconnect callback sets it. The correct behaviour in this case is that the reconnect callback should not be able to overwrite the message handler. I think it is also valid to broaden this to say that noone should ever be able to overwrite the message handler, as this would indicate that startDelivery had been called twice without stopDelivery being called in between, which would indicate a programming error on the part of the client. There are tabs in the patch. For BK/HW the standard is 4 space indentation. Race condition of message handler in connection recovery in Hedwig client - Key: BOOKKEEPER-56 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-56 Project: Bookkeeper Issue Type: Bug Components: hedwig-client Affects Versions: 3.4.0 Reporter: Gavin Li Fix For: 3.4.0 Attachments: patch_56 There's a race condition in the connection recovery logic in Hedwig client. The message handler user set might be overwritten incorrectly. When handling channelDisconnected event, we try to reconnect to Hedwig server. After the connection is created and subscribed, we'll call StartDelivery() to recover the message handler to the original one of the disconnected connection. But if during this process, user calls StartDelivery() to set a new message handler, it will get overwritten to the original one. The process can be demonstrated as below: main thread__netty worker thread __ StartDelivery(messageHandlerA) (connection Broken here, and recovered later...) ResponseHandler::channelDisconnected() (connection disconnected event received) new SubscribeReconnectCallback(subHandler.getMessageHandler()) (store messageHandlerA in SubscribeReconnectCallback to recover later) client.doConnect() (try reconnect) doSubUnsub() (resubscribe) SubscriberResponseHandler::handleSubscribeResponse() (subscription succeeds) StartDelivery(messageHandlderB) SubscribeReconnectCallback::operationFinished() StartDelvery(messageHandlerA) (messageHandler get overwritten) I can stably reproduce this by simulating this race condition by put some sleep in ResponseHandler. I think essentially speaking we should not store messageHandler in ResponseHandler, since the message handler is supposed to be bound to connection. Instead, no matter which connection is in use, we should use the same messageHandler, the one user set last time. So I think we should change to store messageHandler in the HedwigSubscriber, in this way we don't need to recover the handler in connection recovery and thus won't face this race condition. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira