[ https://issues.apache.org/jira/browse/ZOOKEEPER-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290898#comment-17290898 ]
Mate Szalay-Beko edited comment on ZOOKEEPER-4220 at 2/25/21, 12:55 PM: ------------------------------------------------------------------------ Hmm... checking this code and I don't think that this bug would cause an increased number of connection attempts. It might lead to call the `connectOne` method more frequently, but in that method we are always checking if there is an open connection already for the give SID. (and there are more synchronization later in the connection initiations) But I am not 100% sure... also a lot changed since 3.5.5. Also I think this part of the code should be reached, only if dynamic reconfig is used. Are you using dynamic reconfig? Anyway, this is clearly a bug, so I'm going to fix this. I'm just not sure that it would be responsible for the high number of connection attempts you see in the logs. was (Author: symat): Hmm... checking this code and I don't think that this bug would cause an increased number of connection attempts. It might lead to call the `connectOne` method more frequently, but in that method we are always checking if there is an open connection already for the give SID. (and there are more synchronization later in the connection initiations) But I am not 100% sure... also a lot changed since 3.5.5. Anyway, this is clearly a bug, so I'm going to fix this. I'm just not sure that it would be responsible for the high number of connection attempts you see in the logs. > Redundant connection attempts during leader election > ---------------------------------------------------- > > Key: ZOOKEEPER-4220 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4220 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.5.9, 3.6.2 > Reporter: Alex Mirgorodskiy > Assignee: Mate Szalay-Beko > Priority: Major > Fix For: 3.5.10, 3.6.3, 3.7.0 > > > We've seen a few failures or long delays in electing a new leader when the > previous one has a hard host reset (as opposed to just the service process > down, since connections don't need to wait for timeout there). Symptoms are > similar to https://issues.apache.org/jira/browse/ZOOKEEPER-2164. Reducing > cnxTimeout from 5 to 1.5 seconds makes the problem much less frequent, but > doesn't fix it completely. We are still using an old ZooKeeper version > (3.5.5), and the new async connect feature will presumably avoid it. > But we noticed a pattern of twice the expected number of connection attempts > to the same downed instance in the log, and it appears to be due to a code > glitch in QuorumCnxManager.java: > > {code:java} > synchronized void connectOne(long sid) { > ... > if (lastCommittedView.containsKey(sid)) { > knownId = true; > if (connectOne(sid, lastCommittedView.get(sid).electionAddr)) > return; > } > if (lastSeenQV != null && lastProposedView.containsKey(sid) > && (!knownId || (lastProposedView.get(sid).electionAddr != <---- > lastCommittedView.get(sid).electionAddr))) { > knownId = true; > if (connectOne(sid, lastProposedView.get(sid).electionAddr)) > return; > } > {code} > Comparing electionAddrs should be done with !equals presumably, otherwise > connectOne will be invoked an extra time even in the common case when the > addresses do match. > The code around it has changed recently, but the check itself still exists at > the top of master. It might not matter as much with the async connects, but > perhaps it helps even then. -- This message was sent by Atlassian Jira (v8.3.4#803005)