Hello! as far as I can tell, the provided logs are not enough to determine the exact root cause of the problem. Maybe someone else will have a better idea, but my best guess would be that some internal listener thread in the leader (zoo3) died before, so it was not able to parse the leader election messages from zoo1 and/or zoo2. When you restarted the leader, then the listener threads re-initialized, so everything went back to normal.
There were a couple of issues like this reported already: - https://issues.apache.org/jira/browse/ZOOKEEPER-2938 - https://issues.apache.org/jira/browse/ZOOKEEPER-2186 - https://issues.apache.org/jira/browse/ZOOKEEPER-3016 ... ZooKeeper 3.4.14 should already contain the fix for these above. However, we just recently fixed a similar issue: https://issues.apache.org/jira/browse/ZOOKEEPER-3769 this fix will be part of 3.6.1 and 3.5.8. Of course, it is possible that you were hitting an independent / unknown issue... We would need all the logs to verify that (the logs from each ZooKeeper servers since their last restart before starting the rolling upgrade). Anyway, I strongly suggest to upgrade your ZooKeeper cluster, as the 3.4 will be EOL soon, see the announcement: https://mail-archives.apache.org/mod_mbox/zookeeper-user/202004.mbox/browser Kind regards, Mate On Wed, Apr 22, 2020 at 2:14 AM blb.dev <[email protected]> wrote: > Hi team, > > During a recent patching for our ZK quorum, we experienced an unrecoverable > outage. We have performed patches like this many times previously and is > working fine in other environments of ours. The goal was to shut down each > server one by one and provide patch updates then restart. However, this > time, when zoo1 (follower) was shut down, the leader (zoo3) shutdown > connection with remaining follower (zoo2) as well. > > What would cause the entire quorum to shutdown and not recover due to > stopping only zoo1? > > Running zookeeper 3.4.14 in docker containers. > zk_version 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built on > 03/06/2019 16:18 GMT > > > *Config for master nodes:* > tickTime=2000 > maxClientCnxns=0 > dataDir=/data > dataLogDir=/datalog > clientPort=2181 > secureClientPort=2281 > initLimit=10 > syncLimit=5 > autopurge.snapRetainCount=10 > autopurge.purgeInterval=24 > > server.1=zoo1:2888:3888 > server.2=zoo2:2888:3888 > server.3=zoo3:2888:3888 > server.4=zoo4:2888:3888:observer > server.5=zoo5:2888:3888:observer > server.6=zoo6:2888:3888:observer > > *Config for observer nodes:* > tickTime=2000 > maxClientCnxns=0 > dataDir=/data > dataLogDir=/datalog > clientPort=2181 > secureClientPort=2281 > initLimit=10 > syncLimit=5 > autopurge.snapRetainCount=10 > autopurge.purgeInterval=24 > peerType=observer > server.1=zoo1:2888:3888 > server.2=zoo2:2888:3888 > server.3=zoo3:2888:3888 > server.4=zoo4:2888:3888:observer > server.5=zoo5:2888:3888:observer > server.6=zoo6:2888:3888:observer > > zoo1-zoo6 are the FQDNs of each server. > > Shutdown of zoo1 and quorum outage at 05:02 UTC > > > *Logs on zoo3 (leader):* > 2020-04-21 05:02:00,120 [myid:3] - WARN > [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Interrupting SendWorker > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010) > 2020-04-21 05:02:00,096 [myid:3] - WARN > [RecvWorker:1:QuorumCnxManager$RecvWorker@1025] - Connection broken for id > 1, my id = 3, error = > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010) > 2020-04-21 05:02:00,120 [myid:3] - WARN > [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Interrupting SendWorker > 2020-04-21 05:02:00,143 [myid:3] - WARN > [SendWorker:1:QuorumCnxManager$SendWorker@941] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929) > > 2020-04-21 05:02:00,143 [myid:3] - WARN > [SendWorker:1:QuorumCnxManager$SendWorker@951] - Send worker leaving > thread > java.lang.InterruptedException > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929) > java.lang.InterruptedException > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929) > 2020-04-21 05:02:00,143 [myid:3] - WARN > [SendWorker:1:QuorumCnxManager$SendWorker@951] - Send worker leaving > thread > 2020-04-21 05:02:00,185 [myid:3] - ERROR [LearnerHandler-/<fqdn > zoo1>:40504:LearnerHandler@648] - Unexpected exception causing shutdown > while sock still open > java.net.SocketException: Connection reset > at java.net.SocketInputStream.read(SocketInputStream.java:210) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) > at > > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559) > 2020-04-21 05:02:00,185 [myid:3] - WARN [LearnerHandler-/<fqdn > zoo1>:40504:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo1>:40504 > ******** > 2020-04-21 05:02:00,350 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@222] - > Accepted socket connection from /<fqdn of docker host 2>:53238 > 2020-04-21 05:02:00,357 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@942] - Client > attempting to renew session 0x101b48709270416 at /<fqdn of docker host > 2>:53238 > 2020-04-21 05:02:00,357 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@694] - > Established session 0x101b48709270416 with negotiated timeout 18000 for > client /<fqdn of docker host 2>:53238 > > > 2020-04-21 05:02:12,456 [myid:3] - INFO > [SessionTracker:ZooKeeperServer@355] - Expiring session 0x601954721730004, > timeout of 10000ms exceeded > 2020-04-21 05:02:12,456 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x301b434e1e30003, likely client > has closed socket > 2020-04-21 05:02:12,456 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed > socket connection for client /<fdqn docker host 2>:44996 which had > sessionid > 0x301b434e1e30003 > 2020-04-21 05:02:12,457 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x301b434e1e31018, likely client > has closed socket > 2020-04-21 05:02:12,457 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed > socket connection for client /<fdqn docker host 9>:36858 which had > sessionid > 0x301b434e1e31018 > 2020-04-21 05:02:12,458 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@222] - > Accepted socket connection from /<fdqn primary kerberos>:35474 > 2020-04-21 05:02:12,459 [myid:3] - INFO [ProcessThread(sid:3 > cport:-1)::PrepRequestProcessor@487] - Processed session termination for > sessionid: 0x601954721730004 > 2020-04-21 05:02:12,459 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x301b434e1e3100c, likely client > has closed socket > 2020-04-21 05:02:12,459 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed > socket connection for client /<fdqn docker host 9>:41384 which had > sessionid > 0x301b434e1e3100c > 2020-04-21 05:02:12,466 [myid:3] - ERROR [LearnerHandler-/<fqdn > zoo5>:53914:LearnerHandler@648] - Unexpected exception causing shutdown > while sock still open > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) > at > > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559) > 2020-04-21 05:02:12,466 [myid:3] - WARN [LearnerHandler-/<fqdn > zoo5>:53914:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo5>:53914 > ******** > 2020-04-21 05:02:12,475 [myid:3] - ERROR [LearnerHandler-/<fqdn > zoo2>:48814:LearnerHandler@648] - Unexpected exception causing shutdown > while sock still open > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) > at > > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559) > 2020-04-21 05:02:12,475 [myid:3] - WARN [LearnerHandler-/<fqdn > zoo2>:48814:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo2>:48814 > ******** > 2020-04-21 05:02:12,501 [myid:3] - ERROR [LearnerHandler-/<fqdn > zoo4>:57706:LearnerHandler@648] - Unexpected exception causing shutdown > while sock still open > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > 2020-04-21 05:02:12,501 [myid:3] - WARN [LearnerHandler-/<fqdn > zoo4>:57706:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo4>:57706 > ******** > 2020-04-21 05:02:12,514 [myid:3] - INFO > [SessionTracker:ZooKeeperServer@355] - Expiring session 0x201b47640170003, > timeout of 10000ms exceeded > 2020-04-21 05:02:12,515 [myid:3] - INFO > [SessionTracker:ZooKeeperServer@355] - Expiring session 0x501957213710004, > timeout of 10000ms exceeded > 2020-04-21 05:02:12,515 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x301b434e1e3101e, likely client > has closed socket > 2020-04-21 05:02:12,523 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x301b434e1e30fd7, likely client > has closed socket > 2020-04-21 05:02:12,523 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x101786813042d94, likely client > has closed socket > 020-04-21 05:02:12,523 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x301b434e1e30036, likely client > has closed socket > 2020-04-21 05:02:12,524 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x3017815ebc525bc, likely client > has closed socket > 2020-04-21 05:02:12,524 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x1017868130426d1, likely client > has closed socket > 2020-04-21 05:02:12,525 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x301b434e1e30002, likely client > has closed socket > 2020-04-21 05:02:12,525 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to > read additional data from client sessionid 0x301b434e1e3142e, likely client > has closed socket > 2020-04-21 05:02:12,535 [myid:3] - ERROR [LearnerHandler-/<fqdn > zoo6>:48356:LearnerHandler@648] - Unexpected exception causing shutdown > while sock still open > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > 2020-04-21 05:02:12,538 [myid:3] - WARN [LearnerHandler-/<fqdn > zoo6>:48356:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo6>:48356 > ******** > > > 2020-04-21 05:03:08,536 [myid:3] - INFO > [SessionTracker:ZooKeeperServer@355] - Expiring session 0x401957e2e180411, > timeout of 40000ms exceeded > 2020-04-21 05:03:08,543 [myid:3] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@222] - > Accepted socket connection from /<fqdn docker host 3>:43804 > java.net.SocketException: Broken pipe (Write failed) > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) > at java.net.SocketOutputStream.write(SocketOutputStream.java:155) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at > > org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:215) > at > > org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:59) > at > > org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:517) > 2020-04-21 05:03:08,599 [myid:3] - INFO > [WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message > format version), 2 (n.leader), 0x3d000000e7 (n.zxid), 0xd (n.round), > LOOKING > (n.state), 2 (n.sid), 0x3d (n.peerEpoch) LEADING (my state) > 2020-04-21 05:03:08,557 [myid:3] - WARN [Sender-/<fqdn > zoo4>:56420:LearnerHandler@235] - Unexpected exception at LearnerHandler > Socket[addr=/<fqdn zoo4>,port=56420,localport=2888] tickOfNextAckDeadline:8 > synced?:true queuedPacketLength:0 > > > > > *Logs on zoo2:* > 2020-04-21 05:02:00,050 [myid:2] - WARN > [RecvWorker:1:QuorumCnxManager$RecvWorker@1025] - Connection broken for id > 1, my id = 2, error = > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010) > 2020-04-21 05:02:00,072 [myid:2] - WARN > [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Interrupting SendWorker > 2020-04-21 05:02:00,081 [myid:2] - WARN > [SendWorker:1:QuorumCnxManager$SendWorker@941] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094) > at > > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74) > 2020-04-21 05:02:00,081 [myid:2] - WARN > [SendWorker:1:QuorumCnxManager$SendWorker@951] - Send worker leaving > thread > 2020-04-21 05:02:12,348 [myid:2] - WARN > [QuorumPeer[myid=2]/0.0.0.0:2181:Follower@90] - Exception when following > the > leader > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > 2020-04-21 05:02:12,349 [myid:2] - INFO > [QuorumPeer[myid=2]/0.0.0.0:2181:Follower@169] - shutdown called > java.lang.Exception: shutdown Follower > at > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:169) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:986) > > > > Following the remediation I found here, > https://issues.apache.org/jira/browse/ZOOKEEPER-2938, restarting the > leader > node allowed the quorum to resume, but we would like to root cause the > initial connection loss. > > In addition to this happening during the patch, it happened a day ago on > its > own, where the connection loss errors began and a restart of the leader was > required for the quorum to recover -- what would be triggering this > connection broken state? > > > > > > -- > Sent from: http://zookeeper-user.578899.n2.nabble.com/ >
