[
https://issues.apache.org/jira/browse/ZOOKEEPER-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945195#comment-17945195
]
Hany commented on ZOOKEEPER-2106:
---------------------------------
Hi there, do you resolve this issue? We met it too.
> Error when reading from leader causes JVM to hang
> -------------------------------------------------
>
> Key: ZOOKEEPER-2106
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2106
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.5
> Reporter: Robert Joseph Evans
> Priority: Critical
>
> I tried looking through existing JIRA for something like this, but the
> closest I came was ZOOKEEPER-2104. It looks very similar, but I don't know
> if it really is the same thing. Essentially we had a 5 node ensemble for a
> large storm cluster. For a few of the nodes at the same time they get an
> error that looks like.
> {code}
> WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@762] - Connection broken for
> id 2, my id = 4, error =
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)
> WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker
> WARN [SendWorker:2:QuorumCnxManager$SendWorker@679] - Interrupted while
> waiting for message on queue
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
> at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
> WARN [SendWorker:2:QuorumCnxManager$SendWorker@688] - Send worker leaving
> thread
> WARN [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@89] - Exception when
> following the leader
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:189)
> at java.net.SocketInputStream.read(SocketInputStream.java:121)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
> at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
> at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
> {code}
> After that all of the connections are shut down
> {code}
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:NIOServerCnxn@1001] - Closed socket
> connection for client ...
> {code}
> but it does not manage to have the JVM shut down
> {code}
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerZooKeeperServer@139] -
> Shutting down
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:ZooKeeperServer@419] - shutting down
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerRequestProcessor@105] -
> Shutting down
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:CommitProcessor@181] - Shutting down
> INFO [FollowerRequestProcessor:4:FollowerRequestProcessor@95] -
> FollowerRequestProcessor exited loop!
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FinalRequestProcessor@415] - shutdown
> of request processor complete
> INFO [CommitProcessor:4:CommitProcessor@150] - CommitProcessor exited loop!
> WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@354] -
> Exception causing close of session 0x0 due to java.io.IOException:
> ZooKeeperServer not running
> INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@1001] -
> Closed socket connection for client /... (no session established for client)
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:SyncRequestProcessor@175] - Shutting
> down
> INFO [SyncThread:4:SyncRequestProcessor@155] - SyncRequestProcessor exited!
> INFO [QuorumPeer[myid=4]/0.0.0.0:50512:QuorumPeer@670] - LOOKING
> {code}
> after that all connections to that node initiate, and then are shut down with
> ZooKeeperServer not running. It seems to stay in this state indefinitely
> until the process is manually restarted. After that it recovers.
> We have seen this happen on multiple servers at the same time resulting in
> the entire ensemble being unusable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)