[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko updated ZOOKEEPER-4220:
----------------------------------------
    Fix Version/s: 3.7.0
                   3.6.3
                   3.5.10

> Redundant connection attempts during leader election
> ----------------------------------------------------
>
>                 Key: ZOOKEEPER-4220
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4220
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.8, 3.6.2
>            Reporter: Alex Mirgorodskiy
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>             Fix For: 3.5.10, 3.6.3, 3.7.0
>
>
> We've seen a few failures or long delays in electing a new leader when the 
> previous one has a hard host reset (as opposed to just the service process 
> down, since connections don't need to wait for timeout there). Symptoms are 
> similar to https://issues.apache.org/jira/browse/ZOOKEEPER-2164. Reducing 
> cnxTimeout from 5 to 1.5 seconds makes the problem much less frequent, but 
> doesn't fix it completely. We are still using an old ZooKeeper version 
> (3.5.5), and the new async connect feature will presumably avoid it.
> But we noticed a pattern of twice the expected number of connection attempts 
> to the same downed instance in the log, and it appears to be due to a code 
> glitch in QuorumCnxManager.java:
>  
> {code:java}
> synchronized void connectOne(long sid) {
>     ...
>     if (lastCommittedView.containsKey(sid)) {
>         knownId = true;
>         if (connectOne(sid, lastCommittedView.get(sid).electionAddr))
>             return;
>     }
>     if (lastSeenQV != null && lastProposedView.containsKey(sid)
>             && (!knownId || (lastProposedView.get(sid).electionAddr !=   <----
>             lastCommittedView.get(sid).electionAddr))) {
>         knownId = true;
>         if (connectOne(sid, lastProposedView.get(sid).electionAddr))
>             return;
>     }
> {code}
> Comparing electionAddrs should be done with !equals presumably, otherwise 
> connectOne will be invoked an extra time even in the common case when the 
> addresses do match.
> The code around it has changed recently, but the check itself still exists at 
> the top of master. It might not matter as much with the async connects, but 
> perhaps it helps even then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to