[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636159#comment-14636159 ]
Alexander Shraer commented on ZOOKEEPER-2172: --------------------------------------------- It looks like server 2 is crashing before the first reconfig is invoked (18:14:48 is when the reconfig happens). This is what happens on server 2: INFO: paramMap: {quorumPacket=PING 200000001 null} quorum packet send in Follower : org.jboss.byteman.rule.exception.ExecuteException: MethodExpression.interpret : exception invoking method packetToString file /earthquake/reconfig-trunk/materials/server.btm line 34 2015-07-21 18:14:43,735 [myid:2] - ERROR [SyncThread:2:ZooKeeperCriticalThread@48] - Severe unrecoverable error, from thread : SyncThread:2 org.jboss.byteman.rule.exception.ExecuteException: MethodExpression.interpret : exception invoking method packetToString file /earthquake/reconfig-trunk/materials/server.btm line 34 at org.jboss.byteman.rule.expression.MethodExpression.interpret(MethodExpression.java:347) at org.jboss.byteman.rule.expression.MethodExpression.interpret(MethodExpression.java:334) at org.jboss.byteman.rule.Action.interpret(Action.java:144) at net.osrg.earthquake.PBEQHelper_HelperAdapter_Interpreted_2.fire(server.btm) at net.osrg.earthquake.PBEQHelper_HelperAdapter_Interpreted_2.execute0(server.btm) at net.osrg.earthquake.PBEQHelper_HelperAdapter_Interpreted_2.execute(server.btm) at org.jboss.byteman.rule.Rule.execute(Rule.java:684) at org.jboss.byteman.rule.Rule.execute(Rule.java:653) at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java) at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:62) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:186) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:113) Caused by: java.lang.NullPointerException at org.apache.zookeeper.server.quorum.LearnerHandler.packetToString(LearnerHandler.java:261) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.jboss.byteman.rule.expression.MethodExpression.interpret(MethodExpression.java:341) ... 11 more 2015-07-21 18:14:43,735 [myid:2] - INFO [SyncThread:2:ZooKeeperServer$ZooKeeperServerListenerImpl@442] - Thread SyncThread:2 exits, error code 1 2015-07-21 18:14:43,735 [myid:2] - INFO [SyncThread:2:LearnerZooKeeperServer@165] - Shutting down 2015-07-21 18:14:43,735 [myid:2] - INFO [SyncThread:2:ZooKeeperServer@465] - shutting down 2015-07-21 18:14:43,735 [myid:2] - INFO [SyncThread:2:FollowerRequestProcessor@138] - Shutting down 2015-07-21 18:14:43,735 [myid:2] - INFO [SyncThread:2:CommitProcessor@358] - Shutting down 2015-07-21 18:14:43,736 [myid:2] - INFO [FollowerRequestProcessor:2:FollowerRequestProcessor@109] - FollowerRequestProcessor exited loop! then server 1, the leader, is sending an incremental reconfig and timing out and closing the connection. 2015-07-21 18:14:48,750 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@512] - Incremental reconfig Jul 21, 2015 6:14:48 PM net.osrg.earthquake.PBInspector EventFuncReturn INFO: paramMap: {quorumPacket=PROPOSAL 200000002 null} ..... 2015-07-21 18:15:02,739 [myid:1] - WARN [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):LearnerHandler@937] - Closing connection to peer due to transaction timeout. > Cluster crashes when reconfig a new node as a participant > --------------------------------------------------------- > > Key: ZOOKEEPER-2172 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server > Affects Versions: 3.5.0 > Environment: Ubuntu 12.04 + java 7 > Reporter: Ziyou Wang > Priority: Critical > Attachments: history.txt, node-1.log, node-2.log, node-3.log, > zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, > zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, > zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, zoo.cfg.dynamic.10000005d, > zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-1.out, zookeeper-2.log, > zookeeper-2.out, zookeeper-3.log, zookeeper-3.out > > > The operations are quite simple: start three zk servers one by one, then > reconfig the cluster to add the new one as a participant. When I add the > third one, the zk cluster may enter a weird state and cannot recover. > > I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 > cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. > So the first node received the reconfig cmd at 12:53:48. Latter, it logged > “2015-04-20 12:53:52,230 [myid:1] - ERROR > [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception > causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] > - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE > /10.0.0.2:55890 ********”. From then on, the first node and second node > rejected all client connections and the third node didn’t join the cluster as > a participant. The whole cluster was done. > > When the problem happened, all three nodes just used the same dynamic > config file zoo.cfg.dynamic.10000005d which only contained the first two > nodes. But there was another unused dynamic config file in node-1 directory > zoo.cfg.dynamic.next which already contained three nodes. > > When I extended the waiting time between starting the third node and > reconfiguring the cluster, the problem didn’t show again. So it should be a > race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)