[
https://issues.apache.org/jira/browse/ZOOKEEPER-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215665#comment-17215665
]
Stan Henderson commented on ZOOKEEPER-3940:
-------------------------------------------
[~maoling]
I tried a different test today. I pulled the 3.6.2 zookeeper image, tagged it,
and pushed it to my docker repository without any modifications. I then
deployed to my 3 Linux VMs.
I see the same issue of stopping one of the zoo servers, and it not rejoining
without restarting some other server.
The default zoo.cfg from zookeeper:3.6.2
{code:java}
dataDir=/data
dataLogDir=/datalog
tickTime=2000
initLimit=5
syncLimit=2
autopurge.snapRetainCount=3
autopurge.purgeInterval=0
maxClientCnxns=60
standaloneEnabled=true
admin.enableServer=true
server.4=zoo4:2888:3888
server.5=zoo5:2888:3888
server.6=zoo6:2888:3888
{code}
Restart zoo4, and loops with '*Notification time out: 60000*'
Oct 16 16:28:24 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:28:24,138 [myid:4] - INFO
[QuorumPeer[myid=4](plain=disabled)(secure=disabled):FastLeaderElection@979] -
Notification time out: 60000
zoo5 and zoo6 report '*configuration error, or a bug*'
{code:java}
Oct 16 16:29:24 zookeeperpoc5 docker[zookeeper_zoo5_1][6790]: 2020-10-16
21:29:24,172 [myid:5] - WARN
[ListenerHandler-zoo5/172.17.0.2:3888:QuorumCnxManager@662] -
*{color:#DE350B}We got a connection request from a server with our own ID. This
should be either a configuration error, or a bug.{color}*
Oct 16 16:29:24 zookeeperpoc6 docker[zookeeper_zoo6_1][2985]: 2020-10-16
21:29:24,157 [myid:6] - WARN
[ListenerHandler-zoo6/172.17.0.2:3888:QuorumCnxManager@662] -
*{color:#DE350B}We got a connection request from a server with our own ID. This
should be either a configuration error, or a bug.{color}*
{code}
Restart zoo6
zoo4 recovers and rejoins
{code:java}
Oct 16 16:34:52 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:52,974 [myid:4] - INFO
[ListenerHandler-zoo4/172.17.0.2:3888:QuorumCnxManager$Listener$ListenerHandler@1070]
- Received connection request from /9.48.164.42:33134
Oct 16 16:34:52 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:52,974 [myid:4] - INFO
[ListenerHandler-zoo4/172.17.0.2:3888:QuorumCnxManager$Listener$ListenerHandler@1070]
- Received connection request from /9.48.164.42:33134
Oct 16 16:34:52 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:52,994 [myid:4] - INFO
[WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] -
Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6,
n.round:0x1, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format
version:0x2, n.config version:0x0
Oct 16 16:34:52 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:52,994 [myid:4] - INFO
[WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] -
Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6,
n.round:0x1, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format
version:0x2, n.config version:0x0
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,000 [myid:4] - INFO
[WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] -
Notification: my state:LOOKING; n.sid:4, n.state:LOOKING, n.leader:6,
n.round:0x1, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format
version:0x2, n.config version:0x0
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,000 [myid:4] - INFO
[WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] -
Notification: my state:LOOKING; n.sid:4, n.state:LOOKING, n.leader:6,
n.round:0x1, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format
version:0x2, n.config version:0x0
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,001 [myid:4] - INFO
[QuorumConnectionThread-[myid=4]-16:QuorumCnxManager@513] - Have smaller server
identifier, so dropping the connection: (myId:4 --> sid:5)
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,001 [myid:4] - INFO
[QuorumConnectionThread-[myid=4]-16:QuorumCnxManager@513] - Have smaller server
identifier, so dropping the connection: (myId:4 --> sid:5)
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,002 [myid:4] - INFO
[WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] -
Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6,
n.round:0x2, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format
version:0x2, n.config version:0x0
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,002 [myid:4] - INFO
[WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] -
Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6,
n.round:0x2, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format
version:0x2, n.config version:0x0
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,004 [myid:4] - INFO
[WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] -
Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6,
n.round:0x2, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format
version:0x2, n.config version:0x0
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,004 [myid:4] - INFO
[WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] -
Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6,
n.round:0x2, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format
version:0x2, n.config version:0x0
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,205 [myid:4] - INFO
[QuorumPeer[myid=4](plain=disabled)(secure=disabled):QuorumPeer@857] - Peer
state changed: following
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,205 [myid:4] - INFO
[QuorumPeer[myid=4](plain=disabled)(secure=disabled):QuorumPeer@857] - Peer
state changed: following
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,207 [myid:4] - INFO
[QuorumPeer[myid=4](plain=disabled)(secure=disabled):QuorumPeer@1456] -
FOLLOWING
Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16
21:34:53,207 [myid:4] - INFO
[QuorumPeer[myid=4](plain=disabled)(secure=disabled):QuorumPeer@1456] -
FOLLOWING
{code}
> Zookeeper restart of leader causes all zk nodes to not serve requests
> ---------------------------------------------------------------------
>
> Key: ZOOKEEPER-3940
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3940
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.6.2
> Environment: dataDir=/data
> dataLogDir=/datalog
> tickTime=2000
> initLimit=10
> syncLimit=5
> maxClientCnxns=60
> autopurge.snapRetainCount=10
> autopurge.purgeInterval=24
> leaderServes=yes
> standaloneEnabled=false
> admin.enableServer=false
> snapshot.trust.empty=true
> audit.enable=true
> 4lw.commands.whitelist=*
> sslQuorum=true
> quorumListenOnAllIPs=true
> portUnification=false
> serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
> ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
> ssl.quorum.keyStore.password=********
> ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
> ssl.quorum.trustStore.password=********
> ssl.quorum.protocol=TLSv1.2
> ssl.quorum.enabledProtocols=TLSv1.2
> ssl.client.enable=true
> secureClientPort=2281
> client.portUnification=true
> clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty
> ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
> ssl.keyStore.password=********
> ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
> ssl.trustStore.password=********
> ssl.protocol=TLSv1.2
> ssl.enabledProtocols=TLSv1.2
> reconfigEnabled=false
> server.1=zoo1:2888:3888:participant;2181
> server.2=zoo2:2888:3888:participant;2181
> server.3=zoo3:2888:3888:participant;2181
> Reporter: Stan Henderson
> Priority: Critical
> Attachments: nossl-zoo.cfg, zk-docker-containers-nossl.log.zip,
> zk-docker-containers.log.zip, zoo.cfg, zoo.cfg, zoo1-docker-containers.log,
> zoo1-docker-containers.log, zoo1-follower.log, zoo2-docker-containers.log,
> zoo2-leader.log, zoo3-docker-containers.log, zoo3-follower.log
>
>
> We have configured a 3 node zookeeper cluster using the 3.6.2 version in a
> Docker version 1.12.1 containerized environment. This corresponds to Sep 16
> 20:03:01 in the attached docker-containers.log files.
> NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6
> branch
> As a part of our testing, we have restarted each of the zookeeper nodes and
> have seen the following behaviour:
> zoo1, zoo2, and zoo3 healthy (zoo1 is leader)
> We started our testing at approximately Sep 17 13:01:05 in the attached
> docker-containers.log files.
> 1. (simulate patching zoo2)
> - restart zoo2
> - zk_synced_followers 1
> - zoo1 leader
> - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
> - zoo3 healthy
> - waited 5 minutes with no change
> - restart zoo3
> - zoo1 leader
> - zk_synced_followers 1
> - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
> - zoo3 healthy
> - restart zoo2
> - no changes
> - restart zoo3
> - zoo1 leader
> - zk_synced_followers 2
> - zoo2 healthy
> - zoo3 unhealthy (This ZooKeeper instance is not currently serving requests)
> - waited 5 minutes and zoo3 returned to healthy
> 2. simulate patching zoo3
> - zoo1 leader
> - restart zoo3
> - zk_synced_followers 2
> - zoo1, zoo2, and zoo3 healthy
> 3. simulate patching zoo1
> - zoo1 leader
> - restart zoo1
> - zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently
> serving requests)
> - waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44
> - tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still
> unhealthy (this step was not collected in the log files).
> The third case in the above scenarios is the critical one since we are no
> longer able to start any of the zk nodes.
>
> [~maoling] this issue may relate to
> https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the
> first and second cases above that I am working with [~blb93] on.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)