[
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Arshad Mohammad updated ZOOKEEPER-2172:
---------------------------------------
Attachment: ZOOKEEPER-2172-02.patch
The issue occurs when reconfig's PROPOSAL and COMMITANDACTIVATE come in between
the snapshot and the uptodate, while syncing with the leader.
In the existing code the reconfig commit is not processed as it should be
processed for follower. In case of observer the reconfig's commit is processed
properly.
We can process the reconfig's commit for follower in the same way as it is
being processed for observer to fix this issue.
Submitting the fix
> Cluster crashes when reconfig a new node as a participant
> ---------------------------------------------------------
>
> Key: ZOOKEEPER-2172
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection, quorum, server
> Affects Versions: 3.5.0
> Environment: Ubuntu 12.04 + java 7
> Reporter: Ziyou Wang
> Assignee: Arshad Mohammad
> Priority: Critical
> Fix For: 3.5.3
>
> Attachments: ZOOKEEPER-2172-02.patch, ZOOKEEPER-2172.patch,
> history.txt, node-1.log, node-2.log, node-3.log, zoo-1.log, zoo-2-1.log,
> zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, zoo-2212-2.log,
> zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, zoo-3.log,
> zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, zoo.cfg.dynamic.10000005d,
> zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-1.out, zookeeper-2.log,
> zookeeper-2.out, zookeeper-3.log, zookeeper-3.out
>
>
> The operations are quite simple: start three zk servers one by one, then
> reconfig the cluster to add the new one as a participant. When I add the
> third one, the zk cluster may enter a weird state and cannot recover.
>
> I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log.
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged
> “2015-04-20 12:53:52,230 [myid:1] - ERROR
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1]
> - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE
> /10.0.0.2:55890 ********”. From then on, the first node and second node
> rejected all client connections and the third node didn’t join the cluster as
> a participant. The whole cluster was done.
>
> When the problem happened, all three nodes just used the same dynamic
> config file zoo.cfg.dynamic.10000005d which only contained the first two
> nodes. But there was another unused dynamic config file in node-1 directory
> zoo.cfg.dynamic.next which already contained three nodes.
>
> When I extended the waiting time between starting the third node and
> reconfiguring the cluster, the problem didn’t show again. So it should be a
> race condition problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)