Ziyou Wang created ZOOKEEPER-2172:
-------------------------------------
Summary: Cluster crashes when reconfig a new node as a participant
Key: ZOOKEEPER-2172
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
Project: ZooKeeper
Issue Type: Bug
Components: leaderElection, quorum, server
Affects Versions: 3.5.0
Environment: Ubuntu 12.04 + java 7
Reporter: Ziyou Wang
Priority: Critical
The operations are quite simple: start three zk servers one by one, then
reconfig the cluster to add the new one as a participant. When I add the third
one, the zk cluster may enter a weird state and cannot recover.
I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1
cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So
the first node received the reconfig cmd at 12:53:48. Latter, it logged
“2015-04-20 12:53:52,230 [myid:1] - ERROR
[LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception
causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] -
WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE
/10.0.0.2:55890 ********”. From then on, the first node and second node
rejected all client connections and the third node didn’t join the cluster as a
participant. The whole cluster was done.
When the problem happened, all three nodes just used the same dynamic
config file zoo.cfg.dynamic.10000005d which only contained the first two nodes.
But there was another unused dynamic config file in node-1 directory
zoo.cfg.dynamic.next which already contained three nodes.
When I extended the waiting time between starting the third node and
reconfiguring the cluster, the problem didn’t show again. So it should be a
race condition problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)