After running a few more tests, it also happens with 75K simultaneously
expiring ephemeral nodes.

I guess a more fundamental question is whether Zookeeper is meant to handle
something like this or is this huge simultaneous expiration something
unsupported.

Thank you!

On Wed, Jun 12, 2024 at 10:24 AM Ken Mamitsuka <[email protected]> wrote:

> We are running a 5 node Zookeeper cluster on version 3.8.4 (it had the
> same issue with 3.7.1). We run a load test (just a few connections) that
> generates about 300,000 ephemeral nodes. We drop these connections
> simultaneously so those ephemeral nodes expire around the same time. At
> this point we see a crash in LearnerHandler:
>
> Jun 10 17:39:12 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:12,764 [myid:] - INFO  [Snapshot 
> Thread:o.a.z.s.p.FileTxnSnapLog@479] - Snapshotting: 0x100093f83 to 
> /var/lib/zookeeper/version-2/
> snapshot.100093f83
> Jun 10 17:39:12 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:12,765 [myid:] - INFO  
> [SyncThread:5:o.a.z.s.p.FileTxnLog@285] - Creating new log file: log.100093f85
> Jun 10 17:39:13 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:13,520 [myid:] - INFO  [Snapshot 
> Thread:o.a.z.s.ZooKeeperServer@558] - Snapshot taken in 756 ms
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:16,903 [myid:] - ERROR 
> [LearnerHandler-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@719] - Unexpected 
> exception in LearnerHandler:
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> java.io.EOFException: null
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> java.io.DataInputStream.readInt(DataInputStream.java:392)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:656)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:16,903 [myid:] - ERROR 
> [Sender-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@372] - Exception while 
> sending packets in LearnerHandler
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> java.net.SocketException: Connection reset
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> java.io.DataOutputStream.write(DataOutputStream.java:107)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> java.io.FilterOutputStream.write(FilterOutputStream.java:97)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:122)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:71)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:126)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:368)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:65)
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:751)
>
> Then connections to learners drop:
>
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:16,903 [myid:] - INFO  
> [LearnerHandler-/10.1.79.102:35670:o.a.z.s.q.LearnerHandler@1159] - 
> Synchronously closing socket to learner 2.
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:16,903 [myid:] - INFO  
> [LearnerHandler-/10.1.72.1:54384:o.a.z.s.q.LearnerHandler@1159] - 
> Synchronously closing socket to learner 3.
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:16,903 [myid:] - INFO  
> [LearnerHandler-/10.1.71.8:40626:o.a.z.s.q.LearnerHandler@1159] - 
> Synchronously closing socket to learner 4.
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:16,903 [myid:] - WARN  
> [LearnerHandler-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@735] - ******* 
> GOODBYE /10.1.67.28:51888 ********
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:16,904 [myid:] - WARN  
> [LearnerHandler-/10.1.71.8:40626:o.a.z.s.q.LearnerHandler@735] - ******* 
> GOODBYE /10.1.71.8:40626 ********
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:16,904 [myid:] - WARN  
> [LearnerHandler-/10.1.79.102:35670:o.a.z.s.q.LearnerHandler@735] - ******* 
> GOODBYE /10.1.79.102:35670 ********
> Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:16,904 [myid:] - WARN  
> [LearnerHandler-/10.1.72.1:54384:o.a.z.s.q.LearnerHandler@735] - ******* 
> GOODBYE /10.1.72.1:54384 ********
>
> Then it seems like it can't load (or update) the database?
>
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: 
> 2024-06-10 17:39:18,146 [myid:] - ERROR 
> [QuorumPeer[myid=5](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:3181):o.a.z.s.ZooKeeperServer@892]
>  - Error updating DB
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: 
> java.io.IOException: Unreasonable length = 8010890
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:750)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:632)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:448)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:435)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:330)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.ZKDatabase.fastForwardDataBase(ZKDatabase.java:300)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:890)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:838)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.shutdown(LeaderZooKeeperServer.java:162)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:837)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:803)
> Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552)
>
> After this, zookeeper-4 appears to be elected leader and then hits the
> same LearnerHandler crash:
>
> Jun 10 17:39:17 zookeeper-4-automation99999 zookeeper-server[19461]: 
> 2024-06-10 17:39:17,208 [myid:] - INFO  
> [QuorumPeer[myid=4](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:3181):o.a.z.s.q.QuorumPeer@1549]
>  - LEADING
>
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: 
> 2024-06-10 17:39:24,125 [myid:] - ERROR 
> [LearnerHandler-/10.1.67.28:35566:o.a.z.s.q.LearnerHandler@719] - Unexpected 
> exception in LearnerHandler:
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: 
> java.io.EOFException: null
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> java.io.DataInputStream.readInt(DataInputStream.java:392)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:656)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: 
> 2024-06-10 17:39:24,125 [myid:] - ERROR 
> [Sender-/10.1.67.28:35566:o.a.z.s.q.LearnerHandler@372] - Exception while 
> sending packets in LearnerHandler
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: 
> java.net.SocketException: Connection reset
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> java.io.DataOutputStream.write(DataOutputStream.java:107)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> java.io.FilterOutputStream.write(FilterOutputStream.java:97)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:122)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:71)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:126)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:368)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:65)
> Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:751)
>
> The same thing happens to zookeeper-3, at which point there is no quorum
> so no new leader is elected. I apologize if anything I posted is wrong and
> I'm not intimately familiar with Zookeeper config. Our java process args
> look like
>
> java -Dzookeeper.log.dir=/var/log/zookeeper
> -Dzookeeper.log.file=zookeeper-zookeeper-server-zookeeper-1-automation99999.log
> -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p -cp
> /opt/zookeeper-3.8.4/bin/../zookeeper-server/target/classes:/opt/zookeeper-3.8.4/bin/../build/classes:/opt/zookeeper-3.8.4/bin/../zookeeper-server/target/lib/*.jar:/opt/zookeeper-3.8.4/bin/../build/lib/*.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-prometheus-metrics-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-jute-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/snappy-java-1.1.10.5.jar:/opt/zookeeper-3.8.4/bin/../lib/slf4j-api-1.7.30.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_servlet-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_hotspot-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_common-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-native-unix-common-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-native-epoll-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-classes-epoll-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-resolver-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-handler-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-common-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-codec-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-buffer-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/metrics-core-4.1.12.1.jar:/opt/zookeeper-3.8.4/bin/../lib/logback-core-1.2.13.jar:/opt/zookeeper-3.8.4/bin/../lib/logback-classic-1.2.13.jar:/opt/zookeeper-3.8.4/bin/../lib/jline-2.14.6.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-util-ajax-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-util-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-servlet-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-server-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-security-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-io-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-http-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/javax.servlet-api-3.1.0.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-databind-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-core-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-annotations-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/commons-io-2.11.0.jar:/opt/zookeeper-3.8.4/bin/../lib/commons-cli-1.5.0.jar:/opt/zookeeper-3.8.4/bin/../lib/audience-annotations-0.12.0.jar:/opt/zookeeper-3.8.4/bin/../zookeeper-*.jar:/opt/zookeeper-3.8.4/bin/../zookeeper-server/src/main/resources/lib/*.jar:/opt/zookeeper-3.8.4/conf::/opt/zookeeper-3.8.4/conf:/opt/zookeeper-3.8.4/*:/opt/zookeeper-3.8.4/lib/*
> -Xmx1000m
> -Dzookeeper.serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
> -Dzookeeper.ssl.keyStore.location=<XXXX>
> -Dzookeeper.ssl.keyStore.password=<XXXX>
> -Dzookeeper.ssl.trustStore.location=<XXXX>
> -Dzookeeper.ssl.trustStore.password=<XXXX> -Xms12g -Xmx12g -Xmn8g
> -Dfsync.warningthresholdms=200 -Dzookeeper.log.threshold=INFO
> -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC
> -XX:+CMSIncrementalMode -XX:CMSIncrementalDutyCycle=100
> -XX:-CMSIncrementalPacing -XX:PermSize=160m -XX:MaxPermSize=160m
> -XX:SurvivorRatio=4 -Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.local.only=false
> org.apache.zookeeper.server.quorum.QuorumPeerMain
> /opt/zookeeper-3.8.4/conf/zoo.cfg
>
> Our zoo.cfg (not including server config looks like:
>
> tickTime=2000
> peerType=participant
> dataDir=/var/lib/zookeeper
> dataLogDir=/zookeeper/data
> clientPort=2181
> secureClientPort=3181
> maxClientCnxns=20000
> initLimit=5
> syncLimit=2
>


-- 
Ken Mamitsuka
Staff Software Engineer, Cantina
[email protected] | cantina.com

Reply via email to