ZooKeeper-trunk - Build # 2731 - Still Failing
See https://builds.apache.org/job/ZooKeeper-trunk/2731/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 376353 lines...] [junit] at java.lang.Object.wait(Native Method) [junit] at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:559) [junit] at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1036) [junit] 2015-06-19 20:30:11,577 [myid:] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):Leader@613] - Shutting down [junit] 2015-06-19 20:30:11,577 [myid:] - WARN [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):QuorumPeer@1070] - PeerState set to LOOKING [junit] 2015-06-19 20:30:11,578 [myid:] - WARN [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):QuorumPeer@1052] - QuorumPeer main thread exited [junit] 2015-06-19 20:30:11,578 [myid:] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id4] [junit] 2015-06-19 20:30:11,578 [myid:] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id4,name1=replica.4] [junit] 2015-06-19 20:30:11,578 [myid:] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id4,name1=replica.1] [junit] 2015-06-19 20:30:11,578 [myid:] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id4,name1=replica.2] [junit] 2015-06-19 20:30:11,578 [myid:] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id4,name1=replica.3] [junit] 2015-06-19 20:30:11,578 [myid:] - INFO [QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id4,name1=replica.5] [junit] 2015-06-19 20:30:11,579 [myid:] - INFO [main:QuorumUtil@254] - Shutting down leader election QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled) [junit] 2015-06-19 20:30:11,579 [myid:] - INFO [main:QuorumUtil@259] - Waiting for QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled) to exit thread [junit] 2015-06-19 20:30:11,579 [myid:] - INFO [main:QuorumUtil@250] - Shutting down quorum peer QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:11363)(secure=disabled) [junit] 2015-06-19 20:30:11,579 [myid:] - INFO [main:QuorumUtil@254] - Shutting down leader election QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:11363)(secure=disabled) [junit] 2015-06-19 20:30:11,579 [myid:] - INFO [main:QuorumUtil@259] - Waiting for QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:11363)(secure=disabled) to exit thread [junit] 2015-06-19 20:30:11,579 [myid:] - INFO [main:FourLetterWordMain@63] - connecting to 127.0.0.1 11351 [junit] 2015-06-19 20:30:11,580 [myid:] - INFO [main:QuorumUtil@243] - 127.0.0.1:11351 is no longer accepting client connections [junit] 2015-06-19 20:30:11,580 [myid:] - INFO [main:FourLetterWordMain@63] - connecting to 127.0.0.1 11354 [junit] 2015-06-19 20:30:11,580 [myid:] - INFO [main:QuorumUtil@243] - 127.0.0.1:11354 is no longer accepting client connections [junit] 2015-06-19 20:30:11,580 [myid:] - INFO [main:FourLetterWordMain@63] - connecting to 127.0.0.1 11357 [junit] 2015-06-19 20:30:11,581 [myid:] - INFO [main:QuorumUtil@243] - 127.0.0.1:11357 is no longer accepting client connections [junit] 2015-06-19 20:30:11,581 [myid:] - INFO [main:FourLetterWordMain@63] - connecting to 127.0.0.1 11360 [junit] 2015-06-19 20:30:11,581 [myid:] - INFO [main:QuorumUtil@243] - 127.0.0.1:11360 is no longer accepting client connections [junit] 2015-06-19 20:30:11,581 [myid:] - INFO [main:FourLetterWordMain@63] - connecting to 127.0.0.1 11363 [junit] 2015-06-19 20:30:11,581 [myid:] - INFO [main:QuorumUtil@243] - 127.0.0.1:11363 is no longer accepting client connections [junit] 2015-06-19 20:30:11,582 [myid:] - INFO [main:ZKTestCase$1@65] - SUCCEEDED testRemoveAddTwo [junit] 2015-06-19 20:30:11,582 [myid:] - INFO [main:ZKTestCase$1@60] - FINISHED testRemoveAddTwo [junit] Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 115.056 sec, Thread: 1, Class: org.apache.zookeeper.test.ReconfigTest [junit] 2015-06-19 20:30:11,645 [myid:] - INFO [main-SendThread(127.0.0.1:11324):ClientCnxn$SendThread@1138] - Opening socket connection to server 127.0.0.1/127.0.0.1:11324.
[jira] [Commented] (ZOOKEEPER-2185) Run server with -XX:+HeapDumpOnOutOfMemoryError and -XX:OnOutOfMemoryError='kill %p'.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593921#comment-14593921 ] Hudson commented on ZOOKEEPER-2185: --- FAILURE: Integrated in ZooKeeper-trunk #2731 (See [https://builds.apache.org/job/ZooKeeper-trunk/2731/]) ZOOKEEPER-2185: Run server with -XX:+HeapDumpOnOutOfMemoryError and -XX:OnOutOfMemoryError='kill %p' (Chris Nauroth via rgs) (rgs: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1686296) * /zookeeper/trunk/CHANGES.txt * /zookeeper/trunk/bin/zkServer.cmd * /zookeeper/trunk/bin/zkServer.sh * /zookeeper/trunk/src/docs/src/documentation/content/xdocs/zookeeperAdmin.xml Run server with -XX:+HeapDumpOnOutOfMemoryError and -XX:OnOutOfMemoryError='kill %p'. - Key: ZOOKEEPER-2185 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2185 Project: ZooKeeper Issue Type: Improvement Components: documentation, scripts Reporter: Chris Nauroth Assignee: Chris Nauroth Priority: Minor Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2185.001.patch Continuing to run a server process after it runs out of memory can lead to unexpected behavior. This issue proposes that we update scripts and documentation to add these JVM options: # {{-XX:+HeapDumpOnOutOfMemoryError}} for help with post-mortem analysis of why the process ran out of memory. # {{-XX:OnOutOfMemoryError='kill %p'}} to kill the JVM process, under the assumption that a process monitor will restart it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594269#comment-14594269 ] Hadoop QA commented on ZOOKEEPER-2193: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12740593/ZOOKEEPER-2193-v7.patch against trunk revision 1686296. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//console This message is automatically generated. reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Assignee: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, ZOOKEEPER-2193-v4.patch, ZOOKEEPER-2193-v5.patch, ZOOKEEPER-2193-v6.patch, ZOOKEEPER-2193-v7.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: ZOOKEEPER-2193 PreCommit Build #2773
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 372460 lines...] [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] 170b00b3e2ae5bf46c50bdcac4e5684959087aa7 logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] BUILD FAILED /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/build.xml:1782: exec returned: 1 Total time: 13 minutes 16 seconds Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-ZOOKEEPER-Build #2752 Archived 24 artifacts Archive block size is 32768 Received 5 blocks and 33436439 bytes Compression is 0.5% Took 12 sec Recording test results Description set: ZOOKEEPER-2193 Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Updated] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yasuhito Fukuda updated ZOOKEEPER-2193: --- Attachment: ZOOKEEPER-2193-v7.patch reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Assignee: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, ZOOKEEPER-2193-v4.patch, ZOOKEEPER-2193-v5.patch, ZOOKEEPER-2193-v6.patch, ZOOKEEPER-2193-v7.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593142#comment-14593142 ] Yasuhito Fukuda commented on ZOOKEEPER-2193: I am sorry. I did not check the unit-test result carefully. cause of the 3 failures were returned null by getAddress() in excludedSpecialAddress() when a hostname wasn't lookup by dns or an address is wrong. I've attached the new patch. and, I've posted new diff on the reviewboard. https://reviews.apache.org/r/35204/diff/3-4/ reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Assignee: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, ZOOKEEPER-2193-v4.patch, ZOOKEEPER-2193-v5.patch, ZOOKEEPER-2193-v6.patch, ZOOKEEPER-2193-v7.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 35204: ZOOKEEPER-2193: reconfig command completes even if parameter is wrong obviously
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/35204/ --- (Updated 6月 19, 2015, 4:33 p.m.) Review request for zookeeper. Bugs: ZOOKEEPER-2193 https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Repository: zookeeper-git Description --- See ZOOKEEPER-2193 Diffs (updated) - src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java eb045de19c9eeb632e5f2b98c5465abcaead7740 src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java f15f831701f9c8514db5003ebd550cd3880b48c7 Diff: https://reviews.apache.org/r/35204/diff/ Testing --- Thanks, Yasuhito Fukuda
[jira] [Commented] (ZOOKEEPER-776) API should sanity check sessionTimeout argument
[ https://issues.apache.org/jira/browse/ZOOKEEPER-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593423#comment-14593423 ] Bill Havanki commented on ZOOKEEPER-776: No worries! I've been working on other things, so I haven't been blocked at all. Thanks for taking this issue back up! API should sanity check sessionTimeout argument --- Key: ZOOKEEPER-776 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-776 Project: ZooKeeper Issue Type: Improvement Components: c client, java client Affects Versions: 3.2.2, 3.3.0, 3.3.1, 3.4.6, 3.5.0 Environment: OSX 10.6.3, JVM 1.6.0-20 Reporter: Gregory Haskins Assignee: Raul Gutierrez Segales Priority: Minor Fix For: 3.5.2, 3.6.0 Attachments: ZOOKEEPER-776.patch, zookeeper-776-fix.patch, zookeeper-776-fix.patch, zookeeper-776-fix.patch passing in a 0 sessionTimeout to ZooKeeper() constructor leads to errors in subsequent operations. It would be ideal to capture this configuration error at the source by throwing something like an IllegalArgument exception when the bogus sessionTimeout is specified, instead of later when it is utilized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Review Request 35643: ZOOKEEPER-2193: reconfig command completes even if parameter is wrong obviously
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/35643/ --- Review request for zookeeper. Repository: zookeeper-git Description --- See ZOOKEEPER-2193 Diffs - src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java eb045de19c9eeb632e5f2b98c5465abcaead7740 src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java f15f831701f9c8514db5003ebd550cd3880b48c7 Diff: https://reviews.apache.org/r/35643/diff/ Testing --- Thanks, Yasuhito Fukuda
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593156#comment-14593156 ] Akihiro Suda commented on ZOOKEEPER-2172: - Hi, It is so interesting that both of servers 1 and 2 timeout at 15:55:08,439 after reconfig has begun at 15:55:04. After these timeouts, both {{ZooKeeperServer}} cannot be revived and the ensemble gets weird. (However in zoo-3-2.log (Jun 3), server 2 raises {{EOFException}}, not {{SocketTimeoutException}} at 17:15:31). These timeouts are raised by [this while loop|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L515] in server 1 and [this while loop|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L89] in server 2. Unfortunately, we are not sure which types of QuorumPacket are triggering these timeouts. So I think it might be helpful to add {{LOG.debug(qp.getType())}} at [this switch|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L532] for server 1 and [this switch|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L114] for server 2. Perhaps they are not pinging each other? [This|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L924-925] comment in {{LearnerHandler.ping()}} seems interesting. {panel} // If learner hasn't sync properly yet, don't send ping packet // otherwise, the learner will crash {panel} Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, zoo-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)