We are running a 5 node Zookeeper cluster on version 3.8.4 (it had the same issue with 3.7.1). We run a load test (just a few connections) that generates about 300,000 ephemeral nodes. We drop these connections simultaneously so those ephemeral nodes expire around the same time. At this point we see a crash in LearnerHandler:
Jun 10 17:39:12 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:12,764 [myid:] - INFO [Snapshot Thread:o.a.z.s.p.FileTxnSnapLog@479] - Snapshotting: 0x100093f83 to /var/lib/zookeeper/version-2/ snapshot.100093f83 Jun 10 17:39:12 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:12,765 [myid:] - INFO [SyncThread:5:o.a.z.s.p.FileTxnLog@285] - Creating new log file: log.100093f85 Jun 10 17:39:13 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:13,520 [myid:] - INFO [Snapshot Thread:o.a.z.s.ZooKeeperServer@558] - Snapshot taken in 756 ms Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:16,903 [myid:] - ERROR [LearnerHandler-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@719] - Unexpected exception in LearnerHandler: Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: java.io.EOFException: null Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at java.io.DataInputStream.readInt(DataInputStream.java:392) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:656) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:16,903 [myid:] - ERROR [Sender-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@372] - Exception while sending packets in LearnerHandler Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: java.net.SocketException: Connection reset Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at java.net.SocketOutputStream.write(SocketOutputStream.java:155) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at java.io.DataOutputStream.write(DataOutputStream.java:107) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at java.io.FilterOutputStream.write(FilterOutputStream.java:97) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:122) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:71) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:126) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:368) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:65) Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:751) Then connections to learners drop: Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:16,903 [myid:] - INFO [LearnerHandler-/10.1.79.102:35670:o.a.z.s.q.LearnerHandler@1159] - Synchronously closing socket to learner 2. Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:16,903 [myid:] - INFO [LearnerHandler-/10.1.72.1:54384:o.a.z.s.q.LearnerHandler@1159] - Synchronously closing socket to learner 3. Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:16,903 [myid:] - INFO [LearnerHandler-/10.1.71.8:40626:o.a.z.s.q.LearnerHandler@1159] - Synchronously closing socket to learner 4. Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:16,903 [myid:] - WARN [LearnerHandler-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@735] - ******* GOODBYE /10.1.67.28:51888 ******** Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:16,904 [myid:] - WARN [LearnerHandler-/10.1.71.8:40626:o.a.z.s.q.LearnerHandler@735] - ******* GOODBYE /10.1.71.8:40626 ******** Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:16,904 [myid:] - WARN [LearnerHandler-/10.1.79.102:35670:o.a.z.s.q.LearnerHandler@735] - ******* GOODBYE /10.1.79.102:35670 ******** Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:16,904 [myid:] - WARN [LearnerHandler-/10.1.72.1:54384:o.a.z.s.q.LearnerHandler@735] - ******* GOODBYE /10.1.72.1:54384 ******** Then it seems like it can't load (or update) the database? Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: 2024-06-10 17:39:18,146 [myid:] - ERROR [QuorumPeer[myid=5](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:3181):o.a.z.s.ZooKeeperServer@892] - Error updating DB Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: java.io.IOException: Unreasonable length = 8010890 Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:750) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:632) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:448) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:435) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:330) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.ZKDatabase.fastForwardDataBase(ZKDatabase.java:300) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:890) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:838) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.shutdown(LeaderZooKeeperServer.java:162) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:837) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:803) Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552) After this, zookeeper-4 appears to be elected leader and then hits the same LearnerHandler crash: Jun 10 17:39:17 zookeeper-4-automation99999 zookeeper-server[19461]: 2024-06-10 17:39:17,208 [myid:] - INFO [QuorumPeer[myid=4](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:3181):o.a.z.s.q.QuorumPeer@1549] - LEADING Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: 2024-06-10 17:39:24,125 [myid:] - ERROR [LearnerHandler-/10.1.67.28:35566:o.a.z.s.q.LearnerHandler@719] - Unexpected exception in LearnerHandler: Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: java.io.EOFException: null Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at java.io.DataInputStream.readInt(DataInputStream.java:392) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:656) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: 2024-06-10 17:39:24,125 [myid:] - ERROR [Sender-/10.1.67.28:35566:o.a.z.s.q.LearnerHandler@372] - Exception while sending packets in LearnerHandler Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: java.net.SocketException: Connection reset Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at java.net.SocketOutputStream.write(SocketOutputStream.java:155) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at java.io.DataOutputStream.write(DataOutputStream.java:107) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at java.io.FilterOutputStream.write(FilterOutputStream.java:97) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:122) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:71) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:126) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:368) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:65) Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:751) The same thing happens to zookeeper-3, at which point there is no quorum so no new leader is elected. I apologize if anything I posted is wrong and I'm not intimately familiar with Zookeeper config. Our java process args look like java -Dzookeeper.log.dir=/var/log/zookeeper -Dzookeeper.log.file=zookeeper-zookeeper-server-zookeeper-1-automation99999.log -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p -cp /opt/zookeeper-3.8.4/bin/../zookeeper-server/target/classes:/opt/zookeeper-3.8.4/bin/../build/classes:/opt/zookeeper-3.8.4/bin/../zookeeper-server/target/lib/*.jar:/opt/zookeeper-3.8.4/bin/../build/lib/*.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-prometheus-metrics-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-jute-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/snappy-java-1.1.10.5.jar:/opt/zookeeper-3.8.4/bin/../lib/slf4j-api-1.7.30.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_servlet-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_hotspot-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_common-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-native-unix-common-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-native-epoll-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-classes-epoll-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-resolver-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-handler-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-common-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-codec-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-buffer-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/metrics-core-4.1.12.1.jar:/opt/zookeeper-3.8.4/bin/../lib/logback-core-1.2.13.jar:/opt/zookeeper-3.8.4/bin/../lib/logback-classic-1.2.13.jar:/opt/zookeeper-3.8.4/bin/../lib/jline-2.14.6.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-util-ajax-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-util-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-servlet-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-server-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-security-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-io-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-http-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/javax.servlet-api-3.1.0.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-databind-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-core-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-annotations-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/commons-io-2.11.0.jar:/opt/zookeeper-3.8.4/bin/../lib/commons-cli-1.5.0.jar:/opt/zookeeper-3.8.4/bin/../lib/audience-annotations-0.12.0.jar:/opt/zookeeper-3.8.4/bin/../zookeeper-*.jar:/opt/zookeeper-3.8.4/bin/../zookeeper-server/src/main/resources/lib/*.jar:/opt/zookeeper-3.8.4/conf::/opt/zookeeper-3.8.4/conf:/opt/zookeeper-3.8.4/*:/opt/zookeeper-3.8.4/lib/* -Xmx1000m -Dzookeeper.serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory -Dzookeeper.ssl.keyStore.location=<XXXX> -Dzookeeper.ssl.keyStore.password=<XXXX> -Dzookeeper.ssl.trustStore.location=<XXXX> -Dzookeeper.ssl.trustStore.password=<XXXX> -Xms12g -Xmx12g -Xmn8g -Dfsync.warningthresholdms=200 -Dzookeeper.log.threshold=INFO -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:CMSIncrementalDutyCycle=100 -XX:-CMSIncrementalPacing -XX:PermSize=160m -XX:MaxPermSize=160m -XX:SurvivorRatio=4 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /opt/zookeeper-3.8.4/conf/zoo.cfg Our zoo.cfg (not including server config looks like: tickTime=2000 peerType=participant dataDir=/var/lib/zookeeper dataLogDir=/zookeeper/data clientPort=2181 secureClientPort=3181 maxClientCnxns=20000 initLimit=5 syncLimit=2
