Hi team,
During a recent patching for our ZK quorum, we experienced an unrecoverable
outage. We have performed patches like this many times previously and is
working fine in other environments of ours. The goal was to shut down each
server one by one and provide patch updates then restart. However, this
time, when zoo1 (follower) was shut down, the leader (zoo3) shutdown
connection with remaining follower (zoo2) as well.
What would cause the entire quorum to shutdown and not recover due to
stopping only zoo1?
Running zookeeper 3.4.14 in docker containers.
zk_version 3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf, built on
03/06/2019 16:18 GMT
*Config for master nodes:*
tickTime=2000
maxClientCnxns=0
dataDir=/data
dataLogDir=/datalog
clientPort=2181
secureClientPort=2281
initLimit=10
syncLimit=5
autopurge.snapRetainCount=10
autopurge.purgeInterval=24
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
server.4=zoo4:2888:3888:observer
server.5=zoo5:2888:3888:observer
server.6=zoo6:2888:3888:observer
*Config for observer nodes:*
tickTime=2000
maxClientCnxns=0
dataDir=/data
dataLogDir=/datalog
clientPort=2181
secureClientPort=2281
initLimit=10
syncLimit=5
autopurge.snapRetainCount=10
autopurge.purgeInterval=24
peerType=observer
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
server.4=zoo4:2888:3888:observer
server.5=zoo5:2888:3888:observer
server.6=zoo6:2888:3888:observer
zoo1-zoo6 are the FQDNs of each server.
Shutdown of zoo1 and quorum outage at 05:02 UTC
*Logs on zoo3 (leader):*
2020-04-21 05:02:00,120 [myid:3] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Interrupting SendWorker
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010)
2020-04-21 05:02:00,096 [myid:3] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@1025] - Connection broken for id
1, my id = 3, error =
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010)
2020-04-21 05:02:00,120 [myid:3] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Interrupting SendWorker
2020-04-21 05:02:00,143 [myid:3] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@941] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929)
2020-04-21 05:02:00,143 [myid:3] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@951] - Send worker leaving thread
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929)
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929)
2020-04-21 05:02:00,143 [myid:3] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@951] - Send worker leaving thread
2020-04-21 05:02:00,185 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo1>:40504:LearnerHandler@648] - Unexpected exception causing shutdown
while sock still open
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-04-21 05:02:00,185 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo1>:40504:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo1>:40504
********
2020-04-21 05:02:00,350 [myid:3] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@222] -
Accepted socket connection from /<fqdn of docker host 2>:53238
2020-04-21 05:02:00,357 [myid:3] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@942] - Client
attempting to renew session 0x101b48709270416 at /<fqdn of docker host
2>:53238
2020-04-21 05:02:00,357 [myid:3] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@694] -
Established session 0x101b48709270416 with negotiated timeout 18000 for
client /<fqdn of docker host 2>:53238
2020-04-21 05:02:12,456 [myid:3] - INFO
[SessionTracker:ZooKeeperServer@355] - Expiring session 0x601954721730004,
timeout of 10000ms exceeded
2020-04-21 05:02:12,456 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x301b434e1e30003, likely client
has closed socket
2020-04-21 05:02:12,456 [myid:3] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed
socket connection for client /<fdqn docker host 2>:44996 which had sessionid
0x301b434e1e30003
2020-04-21 05:02:12,457 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x301b434e1e31018, likely client
has closed socket
2020-04-21 05:02:12,457 [myid:3] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed
socket connection for client /<fdqn docker host 9>:36858 which had sessionid
0x301b434e1e31018
2020-04-21 05:02:12,458 [myid:3] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@222] -
Accepted socket connection from /<fdqn primary kerberos>:35474
2020-04-21 05:02:12,459 [myid:3] - INFO [ProcessThread(sid:3
cport:-1)::PrepRequestProcessor@487] - Processed session termination for
sessionid: 0x601954721730004
2020-04-21 05:02:12,459 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x301b434e1e3100c, likely client
has closed socket
2020-04-21 05:02:12,459 [myid:3] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1056] - Closed
socket connection for client /<fdqn docker host 9>:41384 which had sessionid
0x301b434e1e3100c
2020-04-21 05:02:12,466 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo5>:53914:LearnerHandler@648] - Unexpected exception causing shutdown
while sock still open
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-04-21 05:02:12,466 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo5>:53914:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo5>:53914
********
2020-04-21 05:02:12,475 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo2>:48814:LearnerHandler@648] - Unexpected exception causing shutdown
while sock still open
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-04-21 05:02:12,475 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo2>:48814:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo2>:48814
********
2020-04-21 05:02:12,501 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo4>:57706:LearnerHandler@648] - Unexpected exception causing shutdown
while sock still open
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-04-21 05:02:12,501 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo4>:57706:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo4>:57706
********
2020-04-21 05:02:12,514 [myid:3] - INFO
[SessionTracker:ZooKeeperServer@355] - Expiring session 0x201b47640170003,
timeout of 10000ms exceeded
2020-04-21 05:02:12,515 [myid:3] - INFO
[SessionTracker:ZooKeeperServer@355] - Expiring session 0x501957213710004,
timeout of 10000ms exceeded
2020-04-21 05:02:12,515 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x301b434e1e3101e, likely client
has closed socket
2020-04-21 05:02:12,523 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x301b434e1e30fd7, likely client
has closed socket
2020-04-21 05:02:12,523 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x101786813042d94, likely client
has closed socket
020-04-21 05:02:12,523 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x301b434e1e30036, likely client
has closed socket
2020-04-21 05:02:12,524 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x3017815ebc525bc, likely client
has closed socket
2020-04-21 05:02:12,524 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x1017868130426d1, likely client
has closed socket
2020-04-21 05:02:12,525 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x301b434e1e30002, likely client
has closed socket
2020-04-21 05:02:12,525 [myid:3] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
read additional data from client sessionid 0x301b434e1e3142e, likely client
has closed socket
2020-04-21 05:02:12,535 [myid:3] - ERROR [LearnerHandler-/<fqdn
zoo6>:48356:LearnerHandler@648] - Unexpected exception causing shutdown
while sock still open
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-04-21 05:02:12,538 [myid:3] - WARN [LearnerHandler-/<fqdn
zoo6>:48356:LearnerHandler@661] - ******* GOODBYE /<fqdn zoo6>:48356
********
2020-04-21 05:03:08,536 [myid:3] - INFO
[SessionTracker:ZooKeeperServer@355] - Expiring session 0x401957e2e180411,
timeout of 40000ms exceeded
2020-04-21 05:03:08,543 [myid:3] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@222] -
Accepted socket connection from /<fqdn docker host 3>:43804
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at
org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:215)
at
org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:59)
at
org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:517)
2020-04-21 05:03:08,599 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@595] - Notification: 1 (message
format version), 2 (n.leader), 0x3d000000e7 (n.zxid), 0xd (n.round), LOOKING
(n.state), 2 (n.sid), 0x3d (n.peerEpoch) LEADING (my state)
2020-04-21 05:03:08,557 [myid:3] - WARN [Sender-/<fqdn
zoo4>:56420:LearnerHandler@235] - Unexpected exception at LearnerHandler
Socket[addr=/<fqdn zoo4>,port=56420,localport=2888] tickOfNextAckDeadline:8
synced?:true queuedPacketLength:0
*Logs on zoo2:*
2020-04-21 05:02:00,050 [myid:2] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@1025] - Connection broken for id
1, my id = 2, error =
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1010)
2020-04-21 05:02:00,072 [myid:2] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Interrupting SendWorker
2020-04-21 05:02:00,081 [myid:2] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@941] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
2020-04-21 05:02:00,081 [myid:2] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@951] - Send worker leaving thread
2020-04-21 05:02:12,348 [myid:2] - WARN
[QuorumPeer[myid=2]/0.0.0.0:2181:Follower@90] - Exception when following the
leader
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
2020-04-21 05:02:12,349 [myid:2] - INFO
[QuorumPeer[myid=2]/0.0.0.0:2181:Follower@169] - shutdown called
java.lang.Exception: shutdown Follower
at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:169)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:986)
Following the remediation I found here,
https://issues.apache.org/jira/browse/ZOOKEEPER-2938, restarting the leader
node allowed the quorum to resume, but we would like to root cause the
initial connection loss.
In addition to this happening during the patch, it happened a day ago on its
own, where the connection loss errors began and a restart of the leader was
required for the quorum to recover -- what would be triggering this
connection broken state?
--
Sent from: http://zookeeper-user.578899.n2.nabble.com/