[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636159#comment-14636159
 ] 

Alexander Shraer commented on ZOOKEEPER-2172:
---------------------------------------------

It looks like server 2 is crashing before the first reconfig is invoked 
(18:14:48 is when the reconfig happens).

This is what happens on server 2:

INFO: paramMap: {quorumPacket=PING 200000001 null}
quorum packet send in Follower : 
org.jboss.byteman.rule.exception.ExecuteException: MethodExpression.interpret : 
exception invoking method packetToString file 
/earthquake/reconfig-trunk/materials/server.btm line 34
2015-07-21 18:14:43,735 [myid:2] - ERROR 
[SyncThread:2:ZooKeeperCriticalThread@48] - Severe unrecoverable error, from 
thread : SyncThread:2
org.jboss.byteman.rule.exception.ExecuteException: MethodExpression.interpret : 
exception invoking method packetToString file 
/earthquake/reconfig-trunk/materials/server.btm line 34
        at 
org.jboss.byteman.rule.expression.MethodExpression.interpret(MethodExpression.java:347)
        at 
org.jboss.byteman.rule.expression.MethodExpression.interpret(MethodExpression.java:334)
        at org.jboss.byteman.rule.Action.interpret(Action.java:144)
        at 
net.osrg.earthquake.PBEQHelper_HelperAdapter_Interpreted_2.fire(server.btm)
        at 
net.osrg.earthquake.PBEQHelper_HelperAdapter_Interpreted_2.execute0(server.btm)
        at 
net.osrg.earthquake.PBEQHelper_HelperAdapter_Interpreted_2.execute(server.btm)
        at org.jboss.byteman.rule.Rule.execute(Rule.java:684)
        at org.jboss.byteman.rule.Rule.execute(Rule.java:653)
        at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java)
        at 
org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:62)
        at 
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:186)
        at 
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:113)
Caused by: java.lang.NullPointerException
        at 
org.apache.zookeeper.server.quorum.LearnerHandler.packetToString(LearnerHandler.java:261)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.jboss.byteman.rule.expression.MethodExpression.interpret(MethodExpression.java:341)
        ... 11 more
2015-07-21 18:14:43,735 [myid:2] - INFO  
[SyncThread:2:ZooKeeperServer$ZooKeeperServerListenerImpl@442] - Thread 
SyncThread:2 exits, error code 1
2015-07-21 18:14:43,735 [myid:2] - INFO  
[SyncThread:2:LearnerZooKeeperServer@165] - Shutting down
2015-07-21 18:14:43,735 [myid:2] - INFO  [SyncThread:2:ZooKeeperServer@465] - 
shutting down
2015-07-21 18:14:43,735 [myid:2] - INFO  
[SyncThread:2:FollowerRequestProcessor@138] - Shutting down
2015-07-21 18:14:43,735 [myid:2] - INFO  [SyncThread:2:CommitProcessor@358] - 
Shutting down
2015-07-21 18:14:43,736 [myid:2] - INFO  
[FollowerRequestProcessor:2:FollowerRequestProcessor@109] - 
FollowerRequestProcessor exited loop!


then server 1, the leader, is sending an incremental reconfig and timing out 
and closing the connection.

2015-07-21 18:14:48,750 [myid:1] - INFO  [ProcessThread(sid:1 
cport:-1)::PrepRequestProcessor@512] - Incremental reconfig
Jul 21, 2015 6:14:48 PM net.osrg.earthquake.PBInspector EventFuncReturn
INFO: paramMap: {quorumPacket=PROPOSAL 200000002 null}
.....
2015-07-21 18:15:02,739 [myid:1] - WARN  
[QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):LearnerHandler@937]
 - Closing connection to peer due to transaction timeout.




> Cluster crashes when reconfig a new node as a participant
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-2172
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.5.0
>         Environment: Ubuntu 12.04 + java 7
>            Reporter: Ziyou Wang
>            Priority: Critical
>         Attachments: history.txt, node-1.log, node-2.log, node-3.log, 
> zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, 
> zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, 
> zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, zoo.cfg.dynamic.10000005d, 
> zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-1.out, zookeeper-2.log, 
> zookeeper-2.out, zookeeper-3.log, zookeeper-3.out
>
>
> The operations are quite simple: start three zk servers one by one, then 
> reconfig the cluster to add the new one as a participant. When I add the  
> third one, the zk cluster may enter a weird state and cannot recover.
>  
>       I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
> “2015-04-20  12:53:52,230 [myid:1] - ERROR 
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
> - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE 
>  /10.0.0.2:55890 ********”. From then on, the first node and second node 
> rejected all client connections and the third node didn’t join the cluster as 
> a participant. The whole cluster was done.
>  
>      When the problem happened, all three nodes just used the same dynamic 
> config file zoo.cfg.dynamic.10000005d which only contained the first two 
> nodes. But there was another unused dynamic config file in node-1 directory 
> zoo.cfg.dynamic.next  which already contained three nodes.
>  
>      When I extended the waiting time between starting the third node and 
> reconfiguring the cluster, the problem didn’t show again. So it should be a 
> race condition problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to