[jira] [Commented] (ZOOKEEPER-3940) Zookeeper restart of leader causes all zk nodes to not serve requests

maoling (Jira) Mon, 26 Oct 2020 02:01:02 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220575#comment-17220575
 ]


maoling commented on ZOOKEEPER-3940:
------------------------------------

_*--->  I wonder if the IP address 172.17.0.2 is causing a similar error as 
others were with 0.0.0.0*_
_*Notice that for each zoo's /etc/hosts file, there are two IP address for the 
host, for example, for zoo4, 9.48.164.39, which is the IP address of the VM 
where docker is running, and 172.17.0.2 which is assigned by docker. Docker 
uses the default 172.17. 0.0/16 subnet for container networking.*_

[~stanhend]

Great finding. the high probability it is. I also don't understand why every 
/etc/hosts mapping is 172.17.0.2 (is it a configuration mis-operation or 
auto-generated by docker) ?
172.17.0.2      zoo4
172.17.0.2      zoo5
172.17.0.2      zoo6
My config looks like this:
{code:java}
172.18.0.2      zoo1
172.18.0.3      zoo2
172.18.0.4      zoo3
{code}
 

> Zookeeper restart of leader causes all zk nodes to not serve requests
> ---------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3940
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3940
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.6.2
>         Environment: dataDir=/data
> dataLogDir=/datalog
> tickTime=2000
> initLimit=10
> syncLimit=5
> maxClientCnxns=60
> autopurge.snapRetainCount=10
> autopurge.purgeInterval=24
> leaderServes=yes
> standaloneEnabled=false
> admin.enableServer=false
> snapshot.trust.empty=true
> audit.enable=true
> 4lw.commands.whitelist=*
> sslQuorum=true
> quorumListenOnAllIPs=true
> portUnification=false
> serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
> ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
> ssl.quorum.keyStore.password=********
> ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
> ssl.quorum.trustStore.password=********
> ssl.quorum.protocol=TLSv1.2
> ssl.quorum.enabledProtocols=TLSv1.2
> ssl.client.enable=true
> secureClientPort=2281
> client.portUnification=true
> clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty
> ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
> ssl.keyStore.password=********
> ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
> ssl.trustStore.password=********
> ssl.protocol=TLSv1.2
> ssl.enabledProtocols=TLSv1.2
> reconfigEnabled=false
> server.1=zoo1:2888:3888:participant;2181
> server.2=zoo2:2888:3888:participant;2181
> server.3=zoo3:2888:3888:participant;2181
>            Reporter: Stan Henderson
>            Priority: Critical
>         Attachments: nossl-zoo.cfg, zk-docker-containers-nossl.log.zip, 
> zk-docker-containers.log.zip, zoo.cfg, zoo.cfg, zoo1-docker-containers.log, 
> zoo1-docker-containers.log, zoo1-follower.log, zoo2-docker-containers.log, 
> zoo2-leader.log, zoo3-docker-containers.log, zoo3-follower.log
>
>
> We have configured a 3 node zookeeper cluster using the 3.6.2 version in a 
> Docker version 1.12.1 containerized environment. This corresponds to Sep 16 
> 20:03:01 in the attached docker-containers.log files.
> NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6 
> branch
> As a part of our testing, we have restarted each of the zookeeper nodes and 
> have seen the following behaviour:
> zoo1, zoo2, and zoo3 healthy (zoo1 is leader)
> We started our testing at approximately Sep 17 13:01:05 in the attached 
> docker-containers.log files.
> 1. (simulate patching zoo2)
> - restart zoo2
> - zk_synced_followers 1
> - zoo1 leader
> - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
> - zoo3 healthy
> - waited 5 minutes with no change
> - restart zoo3
> - zoo1 leader
> - zk_synced_followers 1
> - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
> - zoo3 healthy
> - restart zoo2
> - no changes
> - restart zoo3
> - zoo1 leader
> - zk_synced_followers 2
> - zoo2 healthy
> - zoo3 unhealthy (This ZooKeeper instance is not currently serving requests)
> - waited 5 minutes and zoo3 returned to healthy
> 2. simulate patching zoo3
> - zoo1 leader
> - restart zoo3
> - zk_synced_followers 2
> - zoo1, zoo2, and zoo3 healthy
> 3. simulate patching zoo1
> - zoo1 leader
> - restart zoo1
> - zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently 
> serving requests)
> - waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44
> - tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still 
> unhealthy (this step was not collected in the log files).
> The third case in the above scenarios is the critical one since we are no 
> longer able to start any of the zk nodes.
>  
> [~maoling] this issue may relate to 
> https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the 
> first and second cases above that I am working with [~blb93] on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ZOOKEEPER-3940) Zookeeper restart of leader causes all zk nodes to not serve requests

Reply via email to