[
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205360#comment-16205360
]
Laurie Turner commented on ZOOKEEPER-2164:
------------------------------------------
I believe I have run into this issue (zookeeper versions 3.4.6 and 3.4.10).
These scenarios I've tested lead me to believe I have the same problem. I
have a 3 node cluster and if the leader is "2" and is stopped, the election
will fail and ultimately 1 and 3 respond with "This ZooKeeper instance is not
currently serving requests" from the stat command.
If 2 is restarted, the cluster returns and 2 becomes the leader . This
appears to be the scenario documented above. Sometimes 3 will fail to rejoin
but if it is restarted it will rejoin the cluster successfully.
Essentially the only electable leader is #2. The nodes are built as docker
containers and orchestrated using Kubernetes.
I am searching for a work around or configuration change that will enable the
cluster to be functional if the existing leader fails are there are only 2
nodes (out of 3) available.
> fast leader election keeps failing
> ----------------------------------
>
> Key: ZOOKEEPER-2164
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection
> Affects Versions: 3.4.5
> Reporter: Michi Mutsuzaki
> Fix For: 3.5.4, 3.6.0
>
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader.
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't
> timeout for 5 seconds:
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3:
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a
> while, so I'm guessing later versions have the same issue.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)