[
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547023#comment-14547023
]
Michi Mutsuzaki commented on ZOOKEEPER-2172:
--------------------------------------------
node1 doesn't seem to receive the vote from itself. it receives votes from
node2 and node3:
{noformat}
node-1.log:2015-04-20 12:55:03,358 [myid:1] - INFO
[WorkerReceiver[myid=1]:FastLeaderElection@698] - Notification: 2 (message
format version), 2 (n.leader), 0x100000084 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 2 (n.sid), 0x1 (n.peerEPoch), LEADING (my state)10000005d (n.config
version)
node-1.log:2015-04-20 12:55:51,547 [myid:1] - INFO
[WorkerReceiver[myid=1]:FastLeaderElection@698] - Notification: 2 (message
format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state),
3 (n.sid), 0x1 (n.peerEPoch), LEADING (my state)10000005d (n.config version)
{noformat}
node2 receives votes from node1 and itself:
{noformat}
node-2.log:2015-04-20 12:55:03,361 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@698] - Notification: 2 (message
format version), 1 (n.leader), 0x0 (n.zxid), 0xffffffffffffffff (n.round),
LEADING (n.state), 1 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)10000005d
(n.config version)
node-2.log:2015-04-20 12:55:54,564 [myid:2] - INFO
[WorkerReceiver[myid=2]:FastLeaderElection@698] - Notification: 2 (message
format version), 2 (n.leader), 0x100000084 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)10000005d (n.config
version)
{noformat}
Is node3's vote somehow confusing node1?
Yes, I think this cluster is using the patch from ZOOKEEPER-2031. Do you think
that might be related to this issue?
> Cluster crashes when reconfig a new node as a participant
> ---------------------------------------------------------
>
> Key: ZOOKEEPER-2172
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection, quorum, server
> Affects Versions: 3.5.0
> Environment: Ubuntu 12.04 + java 7
> Reporter: Ziyou Wang
> Priority: Critical
> Attachments: node-1.log, node-2.log, node-3.log,
> zoo.cfg.dynamic.10000005d, zoo.cfg.dynamic.next
>
>
> The operations are quite simple: start three zk servers one by one, then
> reconfig the cluster to add the new one as a participant. When I add the
> third one, the zk cluster may enter a weird state and cannot recover.
>
> I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log.
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged
> “2015-04-20 12:53:52,230 [myid:1] - ERROR
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1]
> - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - ******* GOODBYE
> /10.0.0.2:55890 ********”. From then on, the first node and second node
> rejected all client connections and the third node didn’t join the cluster as
> a participant. The whole cluster was done.
>
> When the problem happened, all three nodes just used the same dynamic
> config file zoo.cfg.dynamic.10000005d which only contained the first two
> nodes. But there was another unused dynamic config file in node-1 directory
> zoo.cfg.dynamic.next which already contained three nodes.
>
> When I extended the waiting time between starting the third node and
> reconfiguring the cluster, the problem didn’t show again. So it should be a
> race condition problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)