ZooKeeper-trunk - Build # 2731 - Still Failing

2015-06-19 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper-trunk/2731/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 376353 lines...]
[junit] at java.lang.Object.wait(Native Method)
[junit] at 
org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:559)
[junit] at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1036)
[junit] 2015-06-19 20:30:11,577 [myid:] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):Leader@613] 
- Shutting down
[junit] 2015-06-19 20:30:11,577 [myid:] - WARN  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):QuorumPeer@1070]
 - PeerState set to LOOKING
[junit] 2015-06-19 20:30:11,578 [myid:] - WARN  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):QuorumPeer@1052]
 - QuorumPeer main thread exited
[junit] 2015-06-19 20:30:11,578 [myid:] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119]
 - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id4]
[junit] 2015-06-19 20:30:11,578 [myid:] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119]
 - Unregister MBean 
[org.apache.ZooKeeperService:name0=ReplicatedServer_id4,name1=replica.4]
[junit] 2015-06-19 20:30:11,578 [myid:] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119]
 - Unregister MBean 
[org.apache.ZooKeeperService:name0=ReplicatedServer_id4,name1=replica.1]
[junit] 2015-06-19 20:30:11,578 [myid:] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119]
 - Unregister MBean 
[org.apache.ZooKeeperService:name0=ReplicatedServer_id4,name1=replica.2]
[junit] 2015-06-19 20:30:11,578 [myid:] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119]
 - Unregister MBean 
[org.apache.ZooKeeperService:name0=ReplicatedServer_id4,name1=replica.3]
[junit] 2015-06-19 20:30:11,578 [myid:] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled):MBeanRegistry@119]
 - Unregister MBean 
[org.apache.ZooKeeperService:name0=ReplicatedServer_id4,name1=replica.5]
[junit] 2015-06-19 20:30:11,579 [myid:] - INFO  [main:QuorumUtil@254] - 
Shutting down leader election 
QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled)
[junit] 2015-06-19 20:30:11,579 [myid:] - INFO  [main:QuorumUtil@259] - 
Waiting for QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:11360)(secure=disabled) 
to exit thread
[junit] 2015-06-19 20:30:11,579 [myid:] - INFO  [main:QuorumUtil@250] - 
Shutting down quorum peer 
QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:11363)(secure=disabled)
[junit] 2015-06-19 20:30:11,579 [myid:] - INFO  [main:QuorumUtil@254] - 
Shutting down leader election 
QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:11363)(secure=disabled)
[junit] 2015-06-19 20:30:11,579 [myid:] - INFO  [main:QuorumUtil@259] - 
Waiting for QuorumPeer[myid=5](plain=/0:0:0:0:0:0:0:0:11363)(secure=disabled) 
to exit thread
[junit] 2015-06-19 20:30:11,579 [myid:] - INFO  
[main:FourLetterWordMain@63] - connecting to 127.0.0.1 11351
[junit] 2015-06-19 20:30:11,580 [myid:] - INFO  [main:QuorumUtil@243] - 
127.0.0.1:11351 is no longer accepting client connections
[junit] 2015-06-19 20:30:11,580 [myid:] - INFO  
[main:FourLetterWordMain@63] - connecting to 127.0.0.1 11354
[junit] 2015-06-19 20:30:11,580 [myid:] - INFO  [main:QuorumUtil@243] - 
127.0.0.1:11354 is no longer accepting client connections
[junit] 2015-06-19 20:30:11,580 [myid:] - INFO  
[main:FourLetterWordMain@63] - connecting to 127.0.0.1 11357
[junit] 2015-06-19 20:30:11,581 [myid:] - INFO  [main:QuorumUtil@243] - 
127.0.0.1:11357 is no longer accepting client connections
[junit] 2015-06-19 20:30:11,581 [myid:] - INFO  
[main:FourLetterWordMain@63] - connecting to 127.0.0.1 11360
[junit] 2015-06-19 20:30:11,581 [myid:] - INFO  [main:QuorumUtil@243] - 
127.0.0.1:11360 is no longer accepting client connections
[junit] 2015-06-19 20:30:11,581 [myid:] - INFO  
[main:FourLetterWordMain@63] - connecting to 127.0.0.1 11363
[junit] 2015-06-19 20:30:11,581 [myid:] - INFO  [main:QuorumUtil@243] - 
127.0.0.1:11363 is no longer accepting client connections
[junit] 2015-06-19 20:30:11,582 [myid:] - INFO  [main:ZKTestCase$1@65] - 
SUCCEEDED testRemoveAddTwo
[junit] 2015-06-19 20:30:11,582 [myid:] - INFO  [main:ZKTestCase$1@60] - 
FINISHED testRemoveAddTwo
[junit] Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
115.056 sec, Thread: 1, Class: org.apache.zookeeper.test.ReconfigTest
[junit] 2015-06-19 20:30:11,645 [myid:] - INFO  
[main-SendThread(127.0.0.1:11324):ClientCnxn$SendThread@1138] - Opening socket 
connection to server 127.0.0.1/127.0.0.1:11324. 

[jira] [Commented] (ZOOKEEPER-2185) Run server with -XX:+HeapDumpOnOutOfMemoryError and -XX:OnOutOfMemoryError='kill %p'.

2015-06-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593921#comment-14593921
 ] 

Hudson commented on ZOOKEEPER-2185:
---

FAILURE: Integrated in ZooKeeper-trunk #2731 (See 
[https://builds.apache.org/job/ZooKeeper-trunk/2731/])
ZOOKEEPER-2185: Run server with -XX:+HeapDumpOnOutOfMemoryError and
-XX:OnOutOfMemoryError='kill %p' (Chris Nauroth via rgs) (rgs: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1686296)
* /zookeeper/trunk/CHANGES.txt
* /zookeeper/trunk/bin/zkServer.cmd
* /zookeeper/trunk/bin/zkServer.sh
* /zookeeper/trunk/src/docs/src/documentation/content/xdocs/zookeeperAdmin.xml


 Run server with -XX:+HeapDumpOnOutOfMemoryError and 
 -XX:OnOutOfMemoryError='kill %p'.
 -

 Key: ZOOKEEPER-2185
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2185
 Project: ZooKeeper
  Issue Type: Improvement
  Components: documentation, scripts
Reporter: Chris Nauroth
Assignee: Chris Nauroth
Priority: Minor
 Fix For: 3.5.1, 3.6.0

 Attachments: ZOOKEEPER-2185.001.patch


 Continuing to run a server process after it runs out of memory can lead to 
 unexpected behavior.  This issue proposes that we update scripts and 
 documentation to add these JVM options:
 # {{-XX:+HeapDumpOnOutOfMemoryError}} for help with post-mortem analysis of 
 why the process ran out of memory.
 # {{-XX:OnOutOfMemoryError='kill %p'}} to kill the JVM process, under the 
 assumption that a process monitor will restart it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously

2015-06-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594269#comment-14594269
 ] 

Hadoop QA commented on ZOOKEEPER-2193:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12740593/ZOOKEEPER-2193-v7.patch
  against trunk revision 1686296.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//console

This message is automatically generated.

 reconfig command completes even if parameter is wrong obviously
 ---

 Key: ZOOKEEPER-2193
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection, server
Affects Versions: 3.5.0
 Environment: CentOS7 + Java7
Reporter: Yasuhito Fukuda
Assignee: Yasuhito Fukuda
 Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, 
 ZOOKEEPER-2193-v4.patch, ZOOKEEPER-2193-v5.patch, ZOOKEEPER-2193-v6.patch, 
 ZOOKEEPER-2193-v7.patch, ZOOKEEPER-2193.patch


 Even if reconfig parameter is wrong, it was confirmed to complete.
 refer to the following.
 - Ensemble consists of four nodes
 {noformat}
 [zk: vm-101:2181(CONNECTED) 0] config
 server.1=192.168.100.101:2888:3888:participant
 server.2=192.168.100.102:2888:3888:participant
 server.3=192.168.100.103:2888:3888:participant
 server.4=192.168.100.104:2888:3888:participant
 version=1
 {noformat}
 - add node by reconfig command
 {noformat}
 [zk: vm-101:2181(CONNECTED) 9] reconfig -add 
 server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181
 Committed new configuration:
 server.1=192.168.100.101:2888:3888:participant
 server.2=192.168.100.102:2888:3888:participant
 server.3=192.168.100.103:2888:3888:participant
 server.4=192.168.100.104:2888:3888:participant
 server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181
 version=30007
 {noformat}
 server.4 and server.5 of the IP address is a duplicate.
 In this state, reader election will not work properly.
 Besides, it is assumed an ensemble will be undesirable state.
 I think that need a parameter validation when reconfig.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: ZOOKEEPER-2193 PreCommit Build #2773

2015-06-19 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-2193
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 372460 lines...]
 [exec] 
 [exec] -1 tests included.  The patch doesn't appear to include any new 
or modified tests.
 [exec] Please justify why no new tests are needed 
for this patch.
 [exec] Also please list what manual steps were 
performed to verify this patch.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
(version 2.0.3) warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] +1 core tests.  The patch passed core unit tests.
 [exec] 
 [exec] +1 contrib tests.  The patch passed contrib unit tests.
 [exec] 
 [exec] Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//testReport/
 [exec] Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
 [exec] Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2773//console
 [exec] 
 [exec] This message is automatically generated.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Adding comment to Jira.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] Comment added.
 [exec] 170b00b3e2ae5bf46c50bdcac4e5684959087aa7 logged out
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Finished build.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 

BUILD FAILED
/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/build.xml:1782:
 exec returned: 1

Total time: 13 minutes 16 seconds
Build step 'Execute shell' marked build as failure
Archiving artifacts
Sending artifact delta relative to PreCommit-ZOOKEEPER-Build #2752
Archived 24 artifacts
Archive block size is 32768
Received 5 blocks and 33436439 bytes
Compression is 0.5%
Took 12 sec
Recording test results
Description set: ZOOKEEPER-2193
Email was triggered for: Failure
Sending email for trigger: Failure



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Updated] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously

2015-06-19 Thread Yasuhito Fukuda (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yasuhito Fukuda updated ZOOKEEPER-2193:
---
Attachment: ZOOKEEPER-2193-v7.patch

 reconfig command completes even if parameter is wrong obviously
 ---

 Key: ZOOKEEPER-2193
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection, server
Affects Versions: 3.5.0
 Environment: CentOS7 + Java7
Reporter: Yasuhito Fukuda
Assignee: Yasuhito Fukuda
 Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, 
 ZOOKEEPER-2193-v4.patch, ZOOKEEPER-2193-v5.patch, ZOOKEEPER-2193-v6.patch, 
 ZOOKEEPER-2193-v7.patch, ZOOKEEPER-2193.patch


 Even if reconfig parameter is wrong, it was confirmed to complete.
 refer to the following.
 - Ensemble consists of four nodes
 {noformat}
 [zk: vm-101:2181(CONNECTED) 0] config
 server.1=192.168.100.101:2888:3888:participant
 server.2=192.168.100.102:2888:3888:participant
 server.3=192.168.100.103:2888:3888:participant
 server.4=192.168.100.104:2888:3888:participant
 version=1
 {noformat}
 - add node by reconfig command
 {noformat}
 [zk: vm-101:2181(CONNECTED) 9] reconfig -add 
 server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181
 Committed new configuration:
 server.1=192.168.100.101:2888:3888:participant
 server.2=192.168.100.102:2888:3888:participant
 server.3=192.168.100.103:2888:3888:participant
 server.4=192.168.100.104:2888:3888:participant
 server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181
 version=30007
 {noformat}
 server.4 and server.5 of the IP address is a duplicate.
 In this state, reader election will not work properly.
 Besides, it is assumed an ensemble will be undesirable state.
 I think that need a parameter validation when reconfig.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously

2015-06-19 Thread Yasuhito Fukuda (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593142#comment-14593142
 ] 

Yasuhito Fukuda commented on ZOOKEEPER-2193:


I am sorry. I did not check the unit-test result carefully.
cause of the 3 failures were returned null by getAddress() in 
excludedSpecialAddress() when a hostname wasn't lookup by dns or an address is 
wrong.
I've attached the new patch. and, I've posted new diff on the reviewboard.
https://reviews.apache.org/r/35204/diff/3-4/

 reconfig command completes even if parameter is wrong obviously
 ---

 Key: ZOOKEEPER-2193
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection, server
Affects Versions: 3.5.0
 Environment: CentOS7 + Java7
Reporter: Yasuhito Fukuda
Assignee: Yasuhito Fukuda
 Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, 
 ZOOKEEPER-2193-v4.patch, ZOOKEEPER-2193-v5.patch, ZOOKEEPER-2193-v6.patch, 
 ZOOKEEPER-2193-v7.patch, ZOOKEEPER-2193.patch


 Even if reconfig parameter is wrong, it was confirmed to complete.
 refer to the following.
 - Ensemble consists of four nodes
 {noformat}
 [zk: vm-101:2181(CONNECTED) 0] config
 server.1=192.168.100.101:2888:3888:participant
 server.2=192.168.100.102:2888:3888:participant
 server.3=192.168.100.103:2888:3888:participant
 server.4=192.168.100.104:2888:3888:participant
 version=1
 {noformat}
 - add node by reconfig command
 {noformat}
 [zk: vm-101:2181(CONNECTED) 9] reconfig -add 
 server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181
 Committed new configuration:
 server.1=192.168.100.101:2888:3888:participant
 server.2=192.168.100.102:2888:3888:participant
 server.3=192.168.100.103:2888:3888:participant
 server.4=192.168.100.104:2888:3888:participant
 server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181
 version=30007
 {noformat}
 server.4 and server.5 of the IP address is a duplicate.
 In this state, reader election will not work properly.
 Besides, it is assumed an ensemble will be undesirable state.
 I think that need a parameter validation when reconfig.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 35204: ZOOKEEPER-2193: reconfig command completes even if parameter is wrong obviously

2015-06-19 Thread Yasuhito Fukuda

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35204/
---

(Updated 6月 19, 2015, 4:33 p.m.)


Review request for zookeeper.


Bugs: ZOOKEEPER-2193
https://issues.apache.org/jira/browse/ZOOKEEPER-2193


Repository: zookeeper-git


Description
---

See ZOOKEEPER-2193


Diffs (updated)
-

  src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java 
eb045de19c9eeb632e5f2b98c5465abcaead7740 
  src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java 
f15f831701f9c8514db5003ebd550cd3880b48c7 

Diff: https://reviews.apache.org/r/35204/diff/


Testing
---


Thanks,

Yasuhito Fukuda



[jira] [Commented] (ZOOKEEPER-776) API should sanity check sessionTimeout argument

2015-06-19 Thread Bill Havanki (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593423#comment-14593423
 ] 

Bill Havanki commented on ZOOKEEPER-776:


No worries! I've been working on other things, so I haven't been blocked at 
all. Thanks for taking this issue back up!

 API should sanity check sessionTimeout argument
 ---

 Key: ZOOKEEPER-776
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-776
 Project: ZooKeeper
  Issue Type: Improvement
  Components: c client, java client
Affects Versions: 3.2.2, 3.3.0, 3.3.1, 3.4.6, 3.5.0
 Environment: OSX 10.6.3, JVM 1.6.0-20
Reporter: Gregory Haskins
Assignee: Raul Gutierrez Segales
Priority: Minor
 Fix For: 3.5.2, 3.6.0

 Attachments: ZOOKEEPER-776.patch, zookeeper-776-fix.patch, 
 zookeeper-776-fix.patch, zookeeper-776-fix.patch


 passing in a 0 sessionTimeout to ZooKeeper() constructor leads to errors in 
 subsequent operations.  It would be ideal to capture this configuration error 
 at the source by throwing something like an IllegalArgument exception when 
 the bogus sessionTimeout is specified, instead of later when it is utilized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Review Request 35643: ZOOKEEPER-2193: reconfig command completes even if parameter is wrong obviously

2015-06-19 Thread Yasuhito Fukuda

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35643/
---

Review request for zookeeper.


Repository: zookeeper-git


Description
---

See ZOOKEEPER-2193


Diffs
-

  src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java 
eb045de19c9eeb632e5f2b98c5465abcaead7740 
  src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java 
f15f831701f9c8514db5003ebd550cd3880b48c7 

Diff: https://reviews.apache.org/r/35643/diff/


Testing
---


Thanks,

Yasuhito Fukuda



[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant

2015-06-19 Thread Akihiro Suda (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593156#comment-14593156
 ] 

Akihiro Suda commented on ZOOKEEPER-2172:
-

Hi,

It is so interesting that both of servers 1 and 2 timeout at 15:55:08,439 after 
reconfig has begun at 15:55:04.
After these timeouts, both {{ZooKeeperServer}} cannot be revived and the 
ensemble gets weird.
(However in zoo-3-2.log (Jun 3), server 2 raises {{EOFException}}, not 
{{SocketTimeoutException}} at 17:15:31).

These timeouts are raised by [this while 
loop|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L515]
 in server 1 and [this while 
loop|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L89]
 in server 2.

Unfortunately, we are not sure which types of QuorumPacket are triggering these 
timeouts.
So I think it might be helpful to add {{LOG.debug(qp.getType())}} at [this 
switch|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L532]
 for server 1 and [this 
switch|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L114]
 for server 2.

Perhaps they are not pinging each other?
[This|https://github.com/apache/zookeeper/blob/77e46cad03d64530ea53be53f5e38e8f1e7e8eee/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java#L924-925]
 comment in {{LearnerHandler.ping()}} seems interesting.
{panel}
// If learner hasn't sync properly yet, don't send ping packet
// otherwise, the learner will crash
{panel}


 Cluster crashes when reconfig a new node as a participant
 -

 Key: ZOOKEEPER-2172
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
 Project: ZooKeeper
  Issue Type: Bug
  Components: leaderElection, quorum, server
Affects Versions: 3.5.0
 Environment: Ubuntu 12.04 + java 7
Reporter: Ziyou Wang
Priority: Critical
 Attachments: node-1.log, node-2.log, node-3.log, zoo-1.log, 
 zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, 
 zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, 
 zoo-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, 
 zookeeper-2.log, zookeeper-3.log


 The operations are quite simple: start three zk servers one by one, then 
 reconfig the cluster to add the new one as a participant. When I add the  
 third one, the zk cluster may enter a weird state and cannot recover.
  
   I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
 So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
 “2015-04-20  12:53:52,230 [myid:1] - ERROR 
 [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
 causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
 - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE 
  /10.0.0.2:55890 ”. From then on, the first node and second node 
 rejected all client connections and the third node didn’t join the cluster as 
 a participant. The whole cluster was done.
  
  When the problem happened, all three nodes just used the same dynamic 
 config file zoo.cfg.dynamic.1005d which only contained the first two 
 nodes. But there was another unused dynamic config file in node-1 directory 
 zoo.cfg.dynamic.next  which already contained three nodes.
  
  When I extended the waiting time between starting the third node and 
 reconfiguring the cluster, the problem didn’t show again. So it should be a 
 race condition problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)