I'm interested in the same... we're thinking of using ephemeral nodes
to indicate the presence of data that this node can be searched for.
The numbers will reach hundreds of thousands on the high end.

On Mon, Jun 17, 2024 at 9:08 PM Ken Mamitsuka <[email protected]> wrote:
>
> After running a few more tests, it also happens with 75K simultaneously
> expiring ephemeral nodes.
>
> I guess a more fundamental question is whether Zookeeper is meant to handle
> something like this or is this huge simultaneous expiration something
> unsupported.
>
> Thank you!
>
> On Wed, Jun 12, 2024 at 10:24 AM Ken Mamitsuka <[email protected]> wrote:
>
> > We are running a 5 node Zookeeper cluster on version 3.8.4 (it had the
> > same issue with 3.7.1). We run a load test (just a few connections) that
> > generates about 300,000 ephemeral nodes. We drop these connections
> > simultaneously so those ephemeral nodes expire around the same time. At
> > this point we see a crash in LearnerHandler:
> >
> > Jun 10 17:39:12 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:12,764 [myid:] - INFO  [Snapshot 
> > Thread:o.a.z.s.p.FileTxnSnapLog@479] - Snapshotting: 0x100093f83 to 
> > /var/lib/zookeeper/version-2/
> > snapshot.100093f83
> > Jun 10 17:39:12 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:12,765 [myid:] - INFO  
> > [SyncThread:5:o.a.z.s.p.FileTxnLog@285] - Creating new log file: 
> > log.100093f85
> > Jun 10 17:39:13 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:13,520 [myid:] - INFO  [Snapshot 
> > Thread:o.a.z.s.ZooKeeperServer@558] - Snapshot taken in 756 ms
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:16,903 [myid:] - ERROR 
> > [LearnerHandler-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@719] - 
> > Unexpected exception in LearnerHandler:
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > java.io.EOFException: null
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > java.io.DataInputStream.readInt(DataInputStream.java:392)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:656)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:16,903 [myid:] - ERROR 
> > [Sender-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@372] - Exception while 
> > sending packets in LearnerHandler
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > java.net.SocketException: Connection reset
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > java.io.DataOutputStream.write(DataOutputStream.java:107)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > java.io.FilterOutputStream.write(FilterOutputStream.java:97)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:122)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:71)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:126)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:368)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:65)
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:751)
> >
> > Then connections to learners drop:
> >
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:16,903 [myid:] - INFO  
> > [LearnerHandler-/10.1.79.102:35670:o.a.z.s.q.LearnerHandler@1159] - 
> > Synchronously closing socket to learner 2.
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:16,903 [myid:] - INFO  
> > [LearnerHandler-/10.1.72.1:54384:o.a.z.s.q.LearnerHandler@1159] - 
> > Synchronously closing socket to learner 3.
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:16,903 [myid:] - INFO  
> > [LearnerHandler-/10.1.71.8:40626:o.a.z.s.q.LearnerHandler@1159] - 
> > Synchronously closing socket to learner 4.
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:16,903 [myid:] - WARN  
> > [LearnerHandler-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@735] - ******* 
> > GOODBYE /10.1.67.28:51888 ********
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:16,904 [myid:] - WARN  
> > [LearnerHandler-/10.1.71.8:40626:o.a.z.s.q.LearnerHandler@735] - ******* 
> > GOODBYE /10.1.71.8:40626 ********
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:16,904 [myid:] - WARN  
> > [LearnerHandler-/10.1.79.102:35670:o.a.z.s.q.LearnerHandler@735] - ******* 
> > GOODBYE /10.1.79.102:35670 ********
> > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:16,904 [myid:] - WARN  
> > [LearnerHandler-/10.1.72.1:54384:o.a.z.s.q.LearnerHandler@735] - ******* 
> > GOODBYE /10.1.72.1:54384 ********
> >
> > Then it seems like it can't load (or update) the database?
> >
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > 2024-06-10 17:39:18,146 [myid:] - ERROR 
> > [QuorumPeer[myid=5](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:3181):o.a.z.s.ZooKeeperServer@892]
> >  - Error updating DB
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: 
> > java.io.IOException: Unreasonable length = 8010890
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:750)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:632)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:448)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:435)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:330)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.ZKDatabase.fastForwardDataBase(ZKDatabase.java:300)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:890)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:838)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.shutdown(LeaderZooKeeperServer.java:162)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:837)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:803)
> > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at 
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552)
> >
> > After this, zookeeper-4 appears to be elected leader and then hits the
> > same LearnerHandler crash:
> >
> > Jun 10 17:39:17 zookeeper-4-automation99999 zookeeper-server[19461]: 
> > 2024-06-10 17:39:17,208 [myid:] - INFO  
> > [QuorumPeer[myid=4](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:3181):o.a.z.s.q.QuorumPeer@1549]
> >  - LEADING
> >
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: 
> > 2024-06-10 17:39:24,125 [myid:] - ERROR 
> > [LearnerHandler-/10.1.67.28:35566:o.a.z.s.q.LearnerHandler@719] - 
> > Unexpected exception in LearnerHandler:
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: 
> > java.io.EOFException: null
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > java.io.DataInputStream.readInt(DataInputStream.java:392)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:656)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: 
> > 2024-06-10 17:39:24,125 [myid:] - ERROR 
> > [Sender-/10.1.67.28:35566:o.a.z.s.q.LearnerHandler@372] - Exception while 
> > sending packets in LearnerHandler
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: 
> > java.net.SocketException: Connection reset
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > java.net.SocketOutputStream.write(SocketOutputStream.java:155)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > java.io.DataOutputStream.write(DataOutputStream.java:107)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > java.io.FilterOutputStream.write(FilterOutputStream.java:97)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:122)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:71)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:126)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:368)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:65)
> > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at 
> > org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:751)
> >
> > The same thing happens to zookeeper-3, at which point there is no quorum
> > so no new leader is elected. I apologize if anything I posted is wrong and
> > I'm not intimately familiar with Zookeeper config. Our java process args
> > look like
> >
> > java -Dzookeeper.log.dir=/var/log/zookeeper
> > -Dzookeeper.log.file=zookeeper-zookeeper-server-zookeeper-1-automation99999.log
> > -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p -cp
> > /opt/zookeeper-3.8.4/bin/../zookeeper-server/target/classes:/opt/zookeeper-3.8.4/bin/../build/classes:/opt/zookeeper-3.8.4/bin/../zookeeper-server/target/lib/*.jar:/opt/zookeeper-3.8.4/bin/../build/lib/*.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-prometheus-metrics-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-jute-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/snappy-java-1.1.10.5.jar:/opt/zookeeper-3.8.4/bin/../lib/slf4j-api-1.7.30.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_servlet-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_hotspot-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_common-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-native-unix-common-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-native-epoll-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-classes-epoll-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-resolver-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-handler-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-common-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-codec-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-buffer-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/metrics-core-4.1.12.1.jar:/opt/zookeeper-3.8.4/bin/../lib/logback-core-1.2.13.jar:/opt/zookeeper-3.8.4/bin/../lib/logback-classic-1.2.13.jar:/opt/zookeeper-3.8.4/bin/../lib/jline-2.14.6.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-util-ajax-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-util-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-servlet-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-server-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-security-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-io-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-http-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/javax.servlet-api-3.1.0.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-databind-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-core-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-annotations-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/commons-io-2.11.0.jar:/opt/zookeeper-3.8.4/bin/../lib/commons-cli-1.5.0.jar:/opt/zookeeper-3.8.4/bin/../lib/audience-annotations-0.12.0.jar:/opt/zookeeper-3.8.4/bin/../zookeeper-*.jar:/opt/zookeeper-3.8.4/bin/../zookeeper-server/src/main/resources/lib/*.jar:/opt/zookeeper-3.8.4/conf::/opt/zookeeper-3.8.4/conf:/opt/zookeeper-3.8.4/*:/opt/zookeeper-3.8.4/lib/*
> > -Xmx1000m
> > -Dzookeeper.serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
> > -Dzookeeper.ssl.keyStore.location=<XXXX>
> > -Dzookeeper.ssl.keyStore.password=<XXXX>
> > -Dzookeeper.ssl.trustStore.location=<XXXX>
> > -Dzookeeper.ssl.trustStore.password=<XXXX> -Xms12g -Xmx12g -Xmn8g
> > -Dfsync.warningthresholdms=200 -Dzookeeper.log.threshold=INFO
> > -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC
> > -XX:+CMSIncrementalMode -XX:CMSIncrementalDutyCycle=100
> > -XX:-CMSIncrementalPacing -XX:PermSize=160m -XX:MaxPermSize=160m
> > -XX:SurvivorRatio=4 -Dcom.sun.management.jmxremote
> > -Dcom.sun.management.jmxremote.local.only=false
> > org.apache.zookeeper.server.quorum.QuorumPeerMain
> > /opt/zookeeper-3.8.4/conf/zoo.cfg
> >
> > Our zoo.cfg (not including server config looks like:
> >
> > tickTime=2000
> > peerType=participant
> > dataDir=/var/lib/zookeeper
> > dataLogDir=/zookeeper/data
> > clientPort=2181
> > secureClientPort=3181
> > maxClientCnxns=20000
> > initLimit=5
> > syncLimit=2
> >
>
>
> --
> Ken Mamitsuka
> Staff Software Engineer, Cantina
> [email protected] | cantina.com

Reply via email to