[jira] [Commented] (ZOOKEEPER-4220) Redundant connection attempts during leader election

Mate Szalay-Beko (Jira) Thu, 25 Feb 2021 04:49:33 -0800


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290898#comment-17290898
 ]


Mate Szalay-Beko commented on ZOOKEEPER-4220:
---------------------------------------------

Hmm... checking this code and I don't think that this bug would cause an 
increased number of connection attempts. It might lead to call the `connectOne` 
method more frequently, but in that method we are always checking if there is 
an open connection already for the give SID. (and there are more 
synchronization later in the connection initiations) But I am not 100% sure... 
also a lot changed since 3.5.5.

Anyway, this is clearly a bug, so I'm going to fix this. I'm just not sure that 
it would be responsible for the high number of connection attempts you see in 
the logs.

> Redundant connection attempts during leader election
> ----------------------------------------------------
>
>                 Key: ZOOKEEPER-4220
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4220
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.9, 3.6.2
>            Reporter: Alex Mirgorodskiy
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>             Fix For: 3.5.10, 3.6.3, 3.7.0
>
>
> We've seen a few failures or long delays in electing a new leader when the 
> previous one has a hard host reset (as opposed to just the service process 
> down, since connections don't need to wait for timeout there). Symptoms are 
> similar to https://issues.apache.org/jira/browse/ZOOKEEPER-2164. Reducing 
> cnxTimeout from 5 to 1.5 seconds makes the problem much less frequent, but 
> doesn't fix it completely. We are still using an old ZooKeeper version 
> (3.5.5), and the new async connect feature will presumably avoid it.
> But we noticed a pattern of twice the expected number of connection attempts 
> to the same downed instance in the log, and it appears to be due to a code 
> glitch in QuorumCnxManager.java:
>  
> {code:java}
> synchronized void connectOne(long sid) {
>     ...
>     if (lastCommittedView.containsKey(sid)) {
>         knownId = true;
>         if (connectOne(sid, lastCommittedView.get(sid).electionAddr))
>             return;
>     }
>     if (lastSeenQV != null && lastProposedView.containsKey(sid)
>             && (!knownId || (lastProposedView.get(sid).electionAddr !=   <----
>             lastCommittedView.get(sid).electionAddr))) {
>         knownId = true;
>         if (connectOne(sid, lastProposedView.get(sid).electionAddr))
>             return;
>     }
> {code}
> Comparing electionAddrs should be done with !equals presumably, otherwise 
> connectOne will be invoked an extra time even in the common case when the 
> addresses do match.
> The code around it has changed recently, but the check itself still exists at 
> the top of master. It might not matter as much with the async connects, but 
> perhaps it helps even then.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ZOOKEEPER-4220) Redundant connection attempts during leader election

Reply via email to