[ https://issues.apache.org/jira/browse/ZOOKEEPER-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mate Szalay-Beko reassigned ZOOKEEPER-4220: ------------------------------------------- Assignee: Mate Szalay-Beko > Redundant connection attempts during leader election > ---------------------------------------------------- > > Key: ZOOKEEPER-4220 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4220 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.5.5 > Reporter: Alex Mirgorodskiy > Assignee: Mate Szalay-Beko > Priority: Major > > We've seen a few failures or long delays in electing a new leader when the > previous one has a hard host reset (as opposed to just the service process > down, since connections don't need to wait for timeout there). Symptoms are > similar to https://issues.apache.org/jira/browse/ZOOKEEPER-2164. Reducing > cnxTimeout from 5 to 1.5 seconds makes the problem much less frequent, but > doesn't fix it completely. We are still using an old ZooKeeper version > (3.5.5), and the new async connect feature will presumably avoid it. > But we noticed a pattern of twice the expected number of connection attempts > to the same downed instance in the log, and it appears to be due to a code > glitch in QuorumCnxManager.java: > > {code:java} > synchronized void connectOne(long sid) { > ... > if (lastCommittedView.containsKey(sid)) { > knownId = true; > if (connectOne(sid, lastCommittedView.get(sid).electionAddr)) > return; > } > if (lastSeenQV != null && lastProposedView.containsKey(sid) > && (!knownId || (lastProposedView.get(sid).electionAddr != <---- > lastCommittedView.get(sid).electionAddr))) { > knownId = true; > if (connectOne(sid, lastProposedView.get(sid).electionAddr)) > return; > } > {code} > Comparing electionAddrs should be done with !equals presumably, otherwise > connectOne will be invoked an extra time even in the common case when the > addresses do match. > The code around it has changed recently, but the check itself still exists at > the top of master. It might not matter as much with the async connects, but > perhaps it helps even then. -- This message was sent by Atlassian Jira (v8.3.4#803005)