[ https://issues.apache.org/jira/browse/ZOOKEEPER-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289791#comment-17289791 ]
Mate Szalay-Beko commented on ZOOKEEPER-4220: --------------------------------------------- Nice catch, [~mirg], thank you!! This is also present on the current master branch: https://github.com/apache/zookeeper/blob/37eae03080b93d63a6ba9f624b37c764511ad2dc/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L771 Would you like to make a PR by changing this line? If you don't have time just now, I'm happy to fix this quickly. > Redundant connection attempts during leader election > ---------------------------------------------------- > > Key: ZOOKEEPER-4220 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4220 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.5.5 > Reporter: Alex Mirgorodskiy > Priority: Major > > We've seen a few failures or long delays in electing a new leader when the > previous one has a hard host reset (as opposed to just the service process > down, since connections don't need to wait for timeout there). Symptoms are > similar to https://issues.apache.org/jira/browse/ZOOKEEPER-2164. Reducing > cnxTimeout from 5 to 1.5 seconds makes the problem much less frequent, but > doesn't fix it completely. We are still using an old ZooKeeper version > (3.5.5), and the new async connect feature will presumably avoid it. > But we noticed a pattern of twice the expected number of connection attempts > to the same downed instance in the log, and it appears to be due to a code > glitch in QuorumCnxManager.java: > > {code:java} > synchronized void connectOne(long sid) { > ... > if (lastCommittedView.containsKey(sid)) { > knownId = true; > if (connectOne(sid, lastCommittedView.get(sid).electionAddr)) > return; > } > if (lastSeenQV != null && lastProposedView.containsKey(sid) > && (!knownId || (lastProposedView.get(sid).electionAddr != <---- > lastCommittedView.get(sid).electionAddr))) { > knownId = true; > if (connectOne(sid, lastProposedView.get(sid).electionAddr)) > return; > } > {code} > Comparing electionAddrs should be done with !equals presumably, otherwise > connectOne will be invoked an extra time even in the common case when the > addresses do match. > The code around it has changed recently, but the check itself still exists at > the top of master. It might not matter as much with the async connects, but > perhaps it helps even then. -- This message was sent by Atlassian Jira (v8.3.4#803005)