We are running a 5 node Zookeeper cluster on version 3.8.4 (it had the same
issue with 3.7.1). We run a load test (just a few connections) that
generates about 300,000 ephemeral nodes. We drop these connections
simultaneously so those ephemeral nodes expire around the same time. At
this point we see a crash in LearnerHandler:

Jun 10 17:39:12 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:12,764 [myid:] - INFO  [Snapshot
Thread:o.a.z.s.p.FileTxnSnapLog@479] - Snapshotting: 0x100093f83 to
/var/lib/zookeeper/version-2/
snapshot.100093f83
Jun 10 17:39:12 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:12,765 [myid:] - INFO
[SyncThread:5:o.a.z.s.p.FileTxnLog@285] - Creating new log file:
log.100093f85
Jun 10 17:39:13 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:13,520 [myid:] - INFO  [Snapshot
Thread:o.a.z.s.ZooKeeperServer@558] - Snapshot taken in 756 ms
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:16,903 [myid:] - ERROR
[LearnerHandler-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@719] -
Unexpected exception in LearnerHandler:
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
java.io.EOFException: null
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at java.io.DataInputStream.readInt(DataInputStream.java:392)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:656)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:16,903 [myid:] - ERROR
[Sender-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@372] - Exception
while sending packets in LearnerHandler
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
java.net.SocketException: Connection reset
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at java.io.DataOutputStream.write(DataOutputStream.java:107)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:122)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:71)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:126)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:368)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:65)
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:751)

Then connections to learners drop:

Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:16,903 [myid:] - INFO
[LearnerHandler-/10.1.79.102:35670:o.a.z.s.q.LearnerHandler@1159] -
Synchronously closing socket to learner 2.
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:16,903 [myid:] - INFO
[LearnerHandler-/10.1.72.1:54384:o.a.z.s.q.LearnerHandler@1159] -
Synchronously closing socket to learner 3.
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:16,903 [myid:] - INFO
[LearnerHandler-/10.1.71.8:40626:o.a.z.s.q.LearnerHandler@1159] -
Synchronously closing socket to learner 4.
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:16,903 [myid:] - WARN
[LearnerHandler-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@735] -
******* GOODBYE /10.1.67.28:51888 ********
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:16,904 [myid:] - WARN
[LearnerHandler-/10.1.71.8:40626:o.a.z.s.q.LearnerHandler@735] -
******* GOODBYE /10.1.71.8:40626 ********
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:16,904 [myid:] - WARN
[LearnerHandler-/10.1.79.102:35670:o.a.z.s.q.LearnerHandler@735] -
******* GOODBYE /10.1.79.102:35670 ********
Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:16,904 [myid:] - WARN
[LearnerHandler-/10.1.72.1:54384:o.a.z.s.q.LearnerHandler@735] -
******* GOODBYE /10.1.72.1:54384 ********

Then it seems like it can't load (or update) the database?

Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
2024-06-10 17:39:18,146 [myid:] - ERROR
[QuorumPeer[myid=5](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:3181):o.a.z.s.ZooKeeperServer@892]
- Error updating DB
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
java.io.IOException: Unreasonable length = 8010890
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:750)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:632)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:448)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:435)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:330)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.ZKDatabase.fastForwardDataBase(ZKDatabase.java:300)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:890)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:838)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at 
org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.shutdown(LeaderZooKeeperServer.java:162)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:837)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:803)
Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]:
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552)

After this, zookeeper-4 appears to be elected leader and then hits the same
LearnerHandler crash:

Jun 10 17:39:17 zookeeper-4-automation99999 zookeeper-server[19461]:
2024-06-10 17:39:17,208 [myid:] - INFO
[QuorumPeer[myid=4](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:3181):o.a.z.s.q.QuorumPeer@1549]
 - LEADING

Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
2024-06-10 17:39:24,125 [myid:] - ERROR
[LearnerHandler-/10.1.67.28:35566:o.a.z.s.q.LearnerHandler@719] -
Unexpected exception in LearnerHandler:
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
java.io.EOFException: null
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at java.io.DataInputStream.readInt(DataInputStream.java:392)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:656)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
2024-06-10 17:39:24,125 [myid:] - ERROR
[Sender-/10.1.67.28:35566:o.a.z.s.q.LearnerHandler@372] - Exception
while sending packets in LearnerHandler
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
java.net.SocketException: Connection reset
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at java.io.DataOutputStream.write(DataOutputStream.java:107)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:122)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at 
org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:71)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:126)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at 
org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:368)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at 
org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:65)
Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]:
at 
org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:751)

The same thing happens to zookeeper-3, at which point there is no quorum so
no new leader is elected. I apologize if anything I posted is wrong and I'm
not intimately familiar with Zookeeper config. Our java process args look
like

java -Dzookeeper.log.dir=/var/log/zookeeper
-Dzookeeper.log.file=zookeeper-zookeeper-server-zookeeper-1-automation99999.log
-XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p -cp
/opt/zookeeper-3.8.4/bin/../zookeeper-server/target/classes:/opt/zookeeper-3.8.4/bin/../build/classes:/opt/zookeeper-3.8.4/bin/../zookeeper-server/target/lib/*.jar:/opt/zookeeper-3.8.4/bin/../build/lib/*.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-prometheus-metrics-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-jute-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/snappy-java-1.1.10.5.jar:/opt/zookeeper-3.8.4/bin/../lib/slf4j-api-1.7.30.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_servlet-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_hotspot-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_common-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-native-unix-common-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-native-epoll-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-classes-epoll-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-resolver-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-handler-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-common-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-codec-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-buffer-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/metrics-core-4.1.12.1.jar:/opt/zookeeper-3.8.4/bin/../lib/logback-core-1.2.13.jar:/opt/zookeeper-3.8.4/bin/../lib/logback-classic-1.2.13.jar:/opt/zookeeper-3.8.4/bin/../lib/jline-2.14.6.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-util-ajax-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-util-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-servlet-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-server-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-security-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-io-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-http-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/javax.servlet-api-3.1.0.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-databind-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-core-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-annotations-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/commons-io-2.11.0.jar:/opt/zookeeper-3.8.4/bin/../lib/commons-cli-1.5.0.jar:/opt/zookeeper-3.8.4/bin/../lib/audience-annotations-0.12.0.jar:/opt/zookeeper-3.8.4/bin/../zookeeper-*.jar:/opt/zookeeper-3.8.4/bin/../zookeeper-server/src/main/resources/lib/*.jar:/opt/zookeeper-3.8.4/conf::/opt/zookeeper-3.8.4/conf:/opt/zookeeper-3.8.4/*:/opt/zookeeper-3.8.4/lib/*
-Xmx1000m
-Dzookeeper.serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
-Dzookeeper.ssl.keyStore.location=<XXXX>
-Dzookeeper.ssl.keyStore.password=<XXXX>
-Dzookeeper.ssl.trustStore.location=<XXXX>
-Dzookeeper.ssl.trustStore.password=<XXXX> -Xms12g -Xmx12g -Xmn8g
-Dfsync.warningthresholdms=200 -Dzookeeper.log.threshold=INFO
-XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode -XX:CMSIncrementalDutyCycle=100
-XX:-CMSIncrementalPacing -XX:PermSize=160m -XX:MaxPermSize=160m
-XX:SurvivorRatio=4 -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.local.only=false
org.apache.zookeeper.server.quorum.QuorumPeerMain
/opt/zookeeper-3.8.4/conf/zoo.cfg

Our zoo.cfg (not including server config looks like:

tickTime=2000
peerType=participant
dataDir=/var/lib/zookeeper
dataLogDir=/zookeeper/data
clientPort=2181
secureClientPort=3181
maxClientCnxns=20000
initLimit=5
syncLimit=2

Reply via email to