Andre Price created ZOOKEEPER-3920:
--------------------------------------

             Summary: Zookeeper clients timeout after leader change
                 Key: ZOOKEEPER-3920
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3920
             Project: ZooKeeper
          Issue Type: Bug
          Components: quorum, server
    Affects Versions: 3.6.1
            Reporter: Andre Price
         Attachments: zk_repro.zip

[Sorry I believe this is a dupe of 
https://issues.apache.org/jira/browse/ZOOKEEPER-3828 and potentially 
https://issues.apache.org/jira/browse/ZOOKEEPER-3466 

But i am not able to attach files there for some reason so creating a new issue 
which hopefully allows me]

We are encountering an issue where failing over from the leader results in 
zookeeper clients not being able to connect successfully. They timeout waiting 
for a response from the server. We are attempting to upgrade some existing 
zookeeper clusters from 3.4.14 to 3.6.1 (not sure if relevant but stating 
incase it helps with pinpointing issue) which is effectively blocked by this 
issue. We perform the rolling upgrade (followers first then leader last) and it 
seems to go successfully by all indicators. But we end up in the state 
described in this issue where if the leader changes (either due to restart or 
stopping) the cluster does not seem able to start new sessions.

I've gathered some TRACE logs from our servers and will attach in the hopes 
they can help figure this out. 

Attached zk_repro.zip which contains the following:
 * zoo.cfg used in one of the instances (they are all the same except for the 
local server's ip being 0.0.0.0 in each)
 * zoo.cfg.dynamic.next (don't think this is used anywhere but is written by 
zookeeper at some point - I think when the first 3.6.1 container becomes leader 
based on the value – the file is in all containers and is the same in all 
servers)
 * s\{1,2,3}_zk.log - logs from each of the 3 servers. Estimated time of repro 
start indicated by "// REPRO START" text and whitespace in logs
 * repro_steps.txt - rough steps executed that result in the server logs 
attached

 

I'll summarize the repro here also:
 # Initially it appears to be a healthy 3 node ensemble all running 3.6.1. 
Server ids are 1,2,3 and 3 is the leader. Dynamic config/reconfiguration is 
disabled.
 # invoke srvr on each node (to verify setup and also create bookmark in logs)
 # Do a zkCli get of /zookeeper/quota  which succeeds
 # Do a restart of the leader (to same image/config) (server 2 now becomes 
leader, 3 is back as follower)
 # Try to perform the same zkCli get which times out (this get is done within 
the container)
 # Try to perform the same zkCli get but from another machine, this also times 
out
 # Invoke srvr on each node again (to verify that 2 is now the leader/bookmark)
 # Do a restart of server 2 (3 becomes leader, 2 follower)
 # Do a zkCli get of /zookeeper/quota which succeeds
 # Invoke srvr on each node again (to verify that 3 is leader)

I tried to keep the other ZK traffic to a minimum but there are likely some 
periodic mntr requests mixed from our metrics scraper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to