Sebastian Schmitz created ZOOKEEPER-3822:
--------------------------------------------
Summary: Zookeeper 3.6.1 EndOfStreamException
Key: ZOOKEEPER-3822
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3822
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.6.1
Reporter: Sebastian Schmitz
Attachments: zookeeper.log
Hello,
after Zookeeper 3.6.1 solved the issue with leader-election containing the IP
and so causing it to fail in separate networks, like in our docker-setup I
updated from 3.4.14 to 3.6.1 in Dev- and Test-Environments. It all went
smoothly and ran for one day. This night I had a new Update of the environment
as we deploy as a whole package of all containers (Kafka, Zookeeper,
Mirrormaker etc.) we also replace the Zookeeper-Containers with latest ones. In
this case, there was no change, the containers were just removed and deployed
again. As the config and data of zookeeper is not stored inside the containers
that's not a problem but this night it broke the whole clusters of Zookeeper
and so also Kafka was down.
* zookeeper_node_1 was stopped and the container removed and created again
* zookeeper_node_1 starts up and the election takes place
* zookeeper_node_2 is elected as leader again
* zookeeper_node_2 is stopped and the container removed and created again
* zookeeper_node_3 is elected as the leader while zookeeper_node_2 is down
* zookeeper_node_2 starts up and zookeeper_node_3 remains leader
And from there all servers just report
2020-05-07 14:07:57,187 [myid:3] - WARN [NIOWorkerThread-2:NIOServerCnxn@364]
- Unexpected exception2020-05-07 14:07:57,187 [myid:3] - WARN
[NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected
exceptionEndOfStreamException: Unable to read additional data from client, it
probably closed the socket: address = /z.z.z.z:46060, session =
0x2014386bbde0000 at
org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
Source) at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
and don't recover.
I was able to recover the cluster in Test-Environment by stopping and starting
all the zookeeper-nodes. The cluster in dev is still in that state and I'm
checking the logs to find out more...
The full log of the deployment that started at 02:00 is attached. The first
time in local NZ-time and the second one is UTC. the IPs I replaced are x.x.x.x
for node_1, y.y.y.y for node_2 and z.z.z.z for node_3
The Kafka-Servers are running on the same machine. Which means that the
EndOfStreamEceptions could also be connections from Kafka as I don't think that
zookeeper_node_3 establish a session with itself?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)