[jira] [Commented] (ZOOKEEPER-3940) Zookeeper restart of leader causes all zk nodes to not serve requests

Stan Henderson (Jira) Sat, 03 Oct 2020 09:55:18 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206792#comment-17206792
 ]


Stan Henderson commented on ZOOKEEPER-3940:
-------------------------------------------

[~maoling] I updated the docker version on our Linux Centos 7 VMs to 18.03.1-ce 
build 9ee9f40 after reading more of the Jira items with people reporting issues 
when running ZK in docker containers.

Things improved, but still not to a desirable state and outcome. 

With zoo1 as the leader, I restart the zoo1 and it never rejoins the quorom. 
mntr always returns *This ZooKeeper instance is not currently serving requests*
In looking at the logs, what I'm seeing appears to be the same as what is 
reported here: https://issues.apache.org/jira/browse/ZOOKEEPER-2938
I attached the zoo1-docker-containers.log, below is just a subset with the 
*Have smaller server identifier, so dropping the connection*

{code:java}
2020-10-03 16:03:48,420 [myid:1] - DEBUG 
[QuorumConnectionThread-[myid=1]-12:QuorumCnxManager@376] - Opening channel to 
server 3
2020-10-03 16:03:48,421 [myid:1] - DEBUG 
[QuorumConnectionThread-[myid=1]-14:QuorumCnxManager@376] - Opening channel to 
server 2
2020-10-03 16:03:48,422 [myid:1] - DEBUG 
[QuorumConnectionThread-[myid=1]-14:QuorumCnxManager@393] - Connected to server 
2 using election address: zoo2/9.48.164.34:3888
2020-10-03 16:03:48,422 [myid:1] - DEBUG 
[QuorumConnectionThread-[myid=1]-14:QuorumCnxManager@468] - startConnection 
(myId:1 --> sid:2)
2020-10-03 16:03:48,421 [myid:1] - DEBUG 
[QuorumConnectionThread-[myid=1]-14:QuorumCnxManager@376] - Opening channel to 
server 2
2020-10-03 16:03:48,422 [myid:1] - DEBUG 
[QuorumConnectionThread-[myid=1]-14:QuorumCnxManager@393] - Connected to server 
2 using election address: zoo2/9.48.164.34:3888
2020-10-03 16:03:48,422 [myid:1] - DEBUG 
[QuorumConnectionThread-[myid=1]-14:QuorumCnxManager@468] - startConnection 
(myId:1 --> sid:2)
2020-10-03 16:03:48,422 [myid:1] - INFO  
[QuorumConnectionThread-[myid=1]-14:QuorumCnxManager@513] - Have smaller server 
identifier, so dropping the connection: (myId:1 --> sid:2)
2020-10-03 16:03:48,422 [myid:1] - INFO  
[QuorumConnectionThread-[myid=1]-14:QuorumCnxManager@513] - Have smaller server 
identifier, so dropping the connection: (myId:1 --> sid:2)
{code}

The only method I have found to get zoo1, zoo2, and zoo3 all into a good state 
is by doing a rolling restart in the order zoo1, zoo2, zoo3 which I found some 
comments about in https://issues.apache.org/jira/browse/ZOOKEEPER-2164

When either zoo2 or zoo3 is the leader, restarting them does not seem to cause 
an issue and all 3 nodes return to a healthy state.  Most often, zoo2 or zoo3 
is selected as the new leader.
I've attached the logs for zoo2 and zoo3 for the working case to compare to 
zoo1 log for the non-working case.

zoo.cfg

{code:java}
dataDir=/data
dataLogDir=/datalog
tickTime=2000
initLimit=10
syncLimit=5
maxClientCnxns=60
autopurge.snapRetainCount=10
autopurge.purgeInterval=24
leaderServes=yes
standaloneEnabled=false
admin.enableServer=false
snapshot.trust.empty=true
audit.enable=true
4lw.commands.whitelist=*
quorumListenOnAllIPs=true
reconfigEnabled=false
server.1=zoo1:2888:3888:participant;2181
server.2=zoo2:2888:3888:participant;2181
server.3=zoo3:2888:3888:participant;2181
{code}

> Zookeeper restart of leader causes all zk nodes to not serve requests
> ---------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3940
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3940
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.6.2
>         Environment: dataDir=/data
> dataLogDir=/datalog
> tickTime=2000
> initLimit=10
> syncLimit=5
> maxClientCnxns=60
> autopurge.snapRetainCount=10
> autopurge.purgeInterval=24
> leaderServes=yes
> standaloneEnabled=false
> admin.enableServer=false
> snapshot.trust.empty=true
> audit.enable=true
> 4lw.commands.whitelist=*
> sslQuorum=true
> quorumListenOnAllIPs=true
> portUnification=false
> serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
> ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
> ssl.quorum.keyStore.password=********
> ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
> ssl.quorum.trustStore.password=********
> ssl.quorum.protocol=TLSv1.2
> ssl.quorum.enabledProtocols=TLSv1.2
> ssl.client.enable=true
> secureClientPort=2281
> client.portUnification=true
> clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty
> ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
> ssl.keyStore.password=********
> ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
> ssl.trustStore.password=********
> ssl.protocol=TLSv1.2
> ssl.enabledProtocols=TLSv1.2
> reconfigEnabled=false
> server.1=zoo1:2888:3888:participant;2181
> server.2=zoo2:2888:3888:participant;2181
> server.3=zoo3:2888:3888:participant;2181
>            Reporter: Stan Henderson
>            Priority: Critical
>         Attachments: nossl-zoo.cfg, zk-docker-containers-nossl.log.zip, 
> zk-docker-containers.log.zip, zoo.cfg, zoo1-docker-containers.log, 
> zoo2-docker-containers.log, zoo3-docker-containers.log
>
>
> We have configured a 3 node zookeeper cluster using the 3.6.2 version in a 
> Docker version 1.12.1 containerized environment. This corresponds to Sep 16 
> 20:03:01 in the attached docker-containers.log files.
> NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6 
> branch
> As a part of our testing, we have restarted each of the zookeeper nodes and 
> have seen the following behaviour:
> zoo1, zoo2, and zoo3 healthy (zoo1 is leader)
> We started our testing at approximately Sep 17 13:01:05 in the attached 
> docker-containers.log files.
> 1. (simulate patching zoo2)
> - restart zoo2
> - zk_synced_followers 1
> - zoo1 leader
> - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
> - zoo3 healthy
> - waited 5 minutes with no change
> - restart zoo3
> - zoo1 leader
> - zk_synced_followers 1
> - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
> - zoo3 healthy
> - restart zoo2
> - no changes
> - restart zoo3
> - zoo1 leader
> - zk_synced_followers 2
> - zoo2 healthy
> - zoo3 unhealthy (This ZooKeeper instance is not currently serving requests)
> - waited 5 minutes and zoo3 returned to healthy
> 2. simulate patching zoo3
> - zoo1 leader
> - restart zoo3
> - zk_synced_followers 2
> - zoo1, zoo2, and zoo3 healthy
> 3. simulate patching zoo1
> - zoo1 leader
> - restart zoo1
> - zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently 
> serving requests)
> - waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44
> - tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still 
> unhealthy (this step was not collected in the log files).
> The third case in the above scenarios is the critical one since we are no 
> longer able to start any of the zk nodes.
>  
> [~maoling] this issue may relate to 
> https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the 
> first and second cases above that I am working with [~blb93] on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ZOOKEEPER-3940) Zookeeper restart of leader causes all zk nodes to not serve requests

Reply via email to