[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289791#comment-17289791
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-4220:
---------------------------------------------

Nice catch, [~mirg], thank you!!

This is also present on the current master branch: 
https://github.com/apache/zookeeper/blob/37eae03080b93d63a6ba9f624b37c764511ad2dc/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L771

Would you like to make a PR by changing this line?
If you don't have time just now, I'm happy to fix this quickly.

> Redundant connection attempts during leader election
> ----------------------------------------------------
>
>                 Key: ZOOKEEPER-4220
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4220
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.5
>            Reporter: Alex Mirgorodskiy
>            Priority: Major
>
> We've seen a few failures or long delays in electing a new leader when the 
> previous one has a hard host reset (as opposed to just the service process 
> down, since connections don't need to wait for timeout there). Symptoms are 
> similar to https://issues.apache.org/jira/browse/ZOOKEEPER-2164. Reducing 
> cnxTimeout from 5 to 1.5 seconds makes the problem much less frequent, but 
> doesn't fix it completely. We are still using an old ZooKeeper version 
> (3.5.5), and the new async connect feature will presumably avoid it.
> But we noticed a pattern of twice the expected number of connection attempts 
> to the same downed instance in the log, and it appears to be due to a code 
> glitch in QuorumCnxManager.java:
>  
> {code:java}
> synchronized void connectOne(long sid) {
>     ...
>     if (lastCommittedView.containsKey(sid)) {
>         knownId = true;
>         if (connectOne(sid, lastCommittedView.get(sid).electionAddr))
>             return;
>     }
>     if (lastSeenQV != null && lastProposedView.containsKey(sid)
>             && (!knownId || (lastProposedView.get(sid).electionAddr !=   <----
>             lastCommittedView.get(sid).electionAddr))) {
>         knownId = true;
>         if (connectOne(sid, lastProposedView.get(sid).electionAddr))
>             return;
>     }
> {code}
> Comparing electionAddrs should be done with !equals presumably, otherwise 
> connectOne will be invoked an extra time even in the common case when the 
> addresses do match.
> The code around it has changed recently, but the check itself still exists at 
> the top of master. It might not matter as much with the async connects, but 
> perhaps it helps even then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to