[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390957#comment-15390957
 ] 

Alexander Shraer commented on ZOOKEEPER-2172:
---------------------------------------------

Could you please provide the logs ?

The existence of .next file indicates that there was a failure in the middle of 
reconfig, and the commitandactivate
message didn't arrive to the server on which you find this file. Which server 
was it ? just 4 or all of them ?
The zoo.cfg.dynamic.100000000 file is the old configuration.

Did the other servers continue to operate normally ? did they reboot ? were 
they able to serve requests afterwards ? would be helpful if you describe this 
too.

7) is actually expected if server 4 crashed before it got the commitandactivate 
message. I described this in the manual:

"Finally, note that once connected to the leader, a joiner adopts the last 
committed configuration, in which it is absent (the initial config of the 
joiner is backed up before being rewritten). If the joiner restarts in this 
state, it will not be able to boot since it is absent from its configuration 
file. In order to start it you’ll once again have to specify an initial 
configuration."

> Cluster crashes when reconfig a new node as a participant
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-2172
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.5.0
>         Environment: Ubuntu 12.04 + java 7
>            Reporter: Ziyou Wang
>            Assignee: Arshad Mohammad
>            Priority: Critical
>             Fix For: 3.5.3
>
>         Attachments: ZOOKEEPER-2172-02.patch, ZOOKEEPER-2172.patch, 
> history.txt, node-1.log, node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, 
> zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, 
> zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, zoo-3.log, 
> zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, zoo.cfg.dynamic.10000005d, 
> zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-1.out, zookeeper-2.log, 
> zookeeper-2.out, zookeeper-3.log, zookeeper-3.out
>
>
> The operations are quite simple: start three zk servers one by one, then 
> reconfig the cluster to add the new one as a participant. When I add the  
> third one, the zk cluster may enter a weird state and cannot recover.
>  
>       I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
> “2015-04-20  12:53:52,230 [myid:1] - ERROR 
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
> - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE 
>  /10.0.0.2:55890 ********”. From then on, the first node and second node 
> rejected all client connections and the third node didn’t join the cluster as 
> a participant. The whole cluster was done.
>  
>      When the problem happened, all three nodes just used the same dynamic 
> config file zoo.cfg.dynamic.10000005d which only contained the first two 
> nodes. But there was another unused dynamic config file in node-1 directory 
> zoo.cfg.dynamic.next  which already contained three nodes.
>  
>      When I extended the waiting time between starting the third node and 
> reconfiguring the cluster, the problem didn’t show again. So it should be a 
> race condition problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to