Ziyou Wang created ZOOKEEPER-2172:
-------------------------------------

             Summary: Cluster crashes when reconfig a new node as a participant
                 Key: ZOOKEEPER-2172
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
             Project: ZooKeeper
          Issue Type: Bug
          Components: leaderElection, quorum, server
    Affects Versions: 3.5.0
         Environment: Ubuntu 12.04 + java 7
            Reporter: Ziyou Wang
            Priority: Critical


The operations are quite simple: start three zk servers one by one, then 
reconfig the cluster to add the new one as a participant. When I add the  third 
one, the zk cluster may enter a weird state and cannot recover.
 
      I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So 
the first node received the reconfig cmd at 12:53:48. Latter, it logged 
“2015-04-20  12:53:52,230 [myid:1] - ERROR 
[LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - 
WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE  
/10.0.0.2:55890 ********”. From then on, the first node and second node 
rejected all client connections and the third node didn’t join the cluster as a 
participant. The whole cluster was done.
 
     When the problem happened, all three nodes just used the same dynamic 
config file zoo.cfg.dynamic.10000005d which only contained the first two nodes. 
But there was another unused dynamic config file in node-1 directory 
zoo.cfg.dynamic.next  which already contained three nodes.
 
     When I extended the waiting time between starting the third node and 
reconfiguring the cluster, the problem didn’t show again. So it should be a 
race condition problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to