[ https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602876#comment-14602876 ]
Filip Deleersnijder commented on ZOOKEEPER-2164: ------------------------------------------------ We experienced a related problem. In a test-setup with 6 servers (3.4.6) with 2 servers shut down, leader election could take a very long time ( 1 to 2 minutes ) to complete. Once we changed the "cnxTO" variable from 5000ms to 500ms in the QuorumCnxManager, it completed under 10 seconds again. In a setup with 8 servers (3.4.6) with 2 servers shut down, leader election could take a very long time ( We have experienced more than 10 minutes ! ) to complete and frequently started again immediately after completing. Monday we will test our cnxTO fix on this setup as well. > fast leader election keeps failing > ---------------------------------- > > Key: ZOOKEEPER-2164 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection > Affects Versions: 3.4.5 > Reporter: Michi Mutsuzaki > Assignee: Hongchao Deng > Fix For: 3.5.2, 3.6.0 > > > I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. > When I shut down 2, 1 and 3 keep going back to leader election. Here is what > seems to be happening. > - Both 1 and 3 elect 3 as the leader. > - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a > follower. > - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't > timeout for 5 seconds: > https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346 > - By the time 3 receives votes, 1 has given up trying to connect to 3: > https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247 > I'm using 3.4.5, but it looks like this part of the code hasn't changed for a > while, so I'm guessing later versions have the same issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)