After running a few more tests, it also happens with 75K simultaneously expiring ephemeral nodes.
I guess a more fundamental question is whether Zookeeper is meant to handle something like this or is this huge simultaneous expiration something unsupported. Thank you! On Wed, Jun 12, 2024 at 10:24 AM Ken Mamitsuka <[email protected]> wrote: > We are running a 5 node Zookeeper cluster on version 3.8.4 (it had the > same issue with 3.7.1). We run a load test (just a few connections) that > generates about 300,000 ephemeral nodes. We drop these connections > simultaneously so those ephemeral nodes expire around the same time. At > this point we see a crash in LearnerHandler: > > Jun 10 17:39:12 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:12,764 [myid:] - INFO [Snapshot > Thread:o.a.z.s.p.FileTxnSnapLog@479] - Snapshotting: 0x100093f83 to > /var/lib/zookeeper/version-2/ > snapshot.100093f83 > Jun 10 17:39:12 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:12,765 [myid:] - INFO > [SyncThread:5:o.a.z.s.p.FileTxnLog@285] - Creating new log file: log.100093f85 > Jun 10 17:39:13 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:13,520 [myid:] - INFO [Snapshot > Thread:o.a.z.s.ZooKeeperServer@558] - Snapshot taken in 756 ms > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:16,903 [myid:] - ERROR > [LearnerHandler-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@719] - Unexpected > exception in LearnerHandler: > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > java.io.EOFException: null > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > java.io.DataInputStream.readInt(DataInputStream.java:392) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:656) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:16,903 [myid:] - ERROR > [Sender-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@372] - Exception while > sending packets in LearnerHandler > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > java.net.SocketException: Connection reset > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > java.net.SocketOutputStream.write(SocketOutputStream.java:155) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > java.io.DataOutputStream.write(DataOutputStream.java:107) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > java.io.FilterOutputStream.write(FilterOutputStream.java:97) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:122) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:71) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:126) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:368) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:65) > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:751) > > Then connections to learners drop: > > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:16,903 [myid:] - INFO > [LearnerHandler-/10.1.79.102:35670:o.a.z.s.q.LearnerHandler@1159] - > Synchronously closing socket to learner 2. > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:16,903 [myid:] - INFO > [LearnerHandler-/10.1.72.1:54384:o.a.z.s.q.LearnerHandler@1159] - > Synchronously closing socket to learner 3. > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:16,903 [myid:] - INFO > [LearnerHandler-/10.1.71.8:40626:o.a.z.s.q.LearnerHandler@1159] - > Synchronously closing socket to learner 4. > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:16,903 [myid:] - WARN > [LearnerHandler-/10.1.67.28:51888:o.a.z.s.q.LearnerHandler@735] - ******* > GOODBYE /10.1.67.28:51888 ******** > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:16,904 [myid:] - WARN > [LearnerHandler-/10.1.71.8:40626:o.a.z.s.q.LearnerHandler@735] - ******* > GOODBYE /10.1.71.8:40626 ******** > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:16,904 [myid:] - WARN > [LearnerHandler-/10.1.79.102:35670:o.a.z.s.q.LearnerHandler@735] - ******* > GOODBYE /10.1.79.102:35670 ******** > Jun 10 17:39:16 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:16,904 [myid:] - WARN > [LearnerHandler-/10.1.72.1:54384:o.a.z.s.q.LearnerHandler@735] - ******* > GOODBYE /10.1.72.1:54384 ******** > > Then it seems like it can't load (or update) the database? > > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: > 2024-06-10 17:39:18,146 [myid:] - ERROR > [QuorumPeer[myid=5](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:3181):o.a.z.s.ZooKeeperServer@892] > - Error updating DB > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: > java.io.IOException: Unreasonable length = 8010890 > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:750) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:632) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:448) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:435) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:330) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.ZKDatabase.fastForwardDataBase(ZKDatabase.java:300) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:890) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:838) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.shutdown(LeaderZooKeeperServer.java:162) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:837) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:803) > Jun 10 17:39:18 zookeeper-5-automation99999 zookeeper-server[19472]: at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1552) > > After this, zookeeper-4 appears to be elected leader and then hits the > same LearnerHandler crash: > > Jun 10 17:39:17 zookeeper-4-automation99999 zookeeper-server[19461]: > 2024-06-10 17:39:17,208 [myid:] - INFO > [QuorumPeer[myid=4](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:3181):o.a.z.s.q.QuorumPeer@1549] > - LEADING > > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: > 2024-06-10 17:39:24,125 [myid:] - ERROR > [LearnerHandler-/10.1.67.28:35566:o.a.z.s.q.LearnerHandler@719] - Unexpected > exception in LearnerHandler: > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: > java.io.EOFException: null > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > java.io.DataInputStream.readInt(DataInputStream.java:392) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:656) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: > 2024-06-10 17:39:24,125 [myid:] - ERROR > [Sender-/10.1.67.28:35566:o.a.z.s.q.LearnerHandler@372] - Exception while > sending packets in LearnerHandler > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: > java.net.SocketException: Connection reset > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:115) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > java.net.SocketOutputStream.write(SocketOutputStream.java:155) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > java.io.DataOutputStream.write(DataOutputStream.java:107) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > java.io.FilterOutputStream.write(FilterOutputStream.java:97) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:122) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > org.apache.zookeeper.server.quorum.QuorumPacket.serialize(QuorumPacket.java:71) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:126) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:368) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > org.apache.zookeeper.server.quorum.LearnerHandler.access$200(LearnerHandler.java:65) > Jun 10 17:39:24 zookeeper-4-automation99999 zookeeper-server[19461]: at > org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:751) > > The same thing happens to zookeeper-3, at which point there is no quorum > so no new leader is elected. I apologize if anything I posted is wrong and > I'm not intimately familiar with Zookeeper config. Our java process args > look like > > java -Dzookeeper.log.dir=/var/log/zookeeper > -Dzookeeper.log.file=zookeeper-zookeeper-server-zookeeper-1-automation99999.log > -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p -cp > /opt/zookeeper-3.8.4/bin/../zookeeper-server/target/classes:/opt/zookeeper-3.8.4/bin/../build/classes:/opt/zookeeper-3.8.4/bin/../zookeeper-server/target/lib/*.jar:/opt/zookeeper-3.8.4/bin/../build/lib/*.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-prometheus-metrics-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-jute-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/zookeeper-3.8.4.jar:/opt/zookeeper-3.8.4/bin/../lib/snappy-java-1.1.10.5.jar:/opt/zookeeper-3.8.4/bin/../lib/slf4j-api-1.7.30.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_servlet-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_hotspot-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient_common-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/simpleclient-0.9.0.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-native-unix-common-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-native-epoll-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-classes-epoll-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-transport-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-resolver-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-handler-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-common-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-codec-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/netty-buffer-4.1.105.Final.jar:/opt/zookeeper-3.8.4/bin/../lib/metrics-core-4.1.12.1.jar:/opt/zookeeper-3.8.4/bin/../lib/logback-core-1.2.13.jar:/opt/zookeeper-3.8.4/bin/../lib/logback-classic-1.2.13.jar:/opt/zookeeper-3.8.4/bin/../lib/jline-2.14.6.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-util-ajax-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-util-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-servlet-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-server-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-security-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-io-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/jetty-http-9.4.53.v20231009.jar:/opt/zookeeper-3.8.4/bin/../lib/javax.servlet-api-3.1.0.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-databind-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-core-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/jackson-annotations-2.15.2.jar:/opt/zookeeper-3.8.4/bin/../lib/commons-io-2.11.0.jar:/opt/zookeeper-3.8.4/bin/../lib/commons-cli-1.5.0.jar:/opt/zookeeper-3.8.4/bin/../lib/audience-annotations-0.12.0.jar:/opt/zookeeper-3.8.4/bin/../zookeeper-*.jar:/opt/zookeeper-3.8.4/bin/../zookeeper-server/src/main/resources/lib/*.jar:/opt/zookeeper-3.8.4/conf::/opt/zookeeper-3.8.4/conf:/opt/zookeeper-3.8.4/*:/opt/zookeeper-3.8.4/lib/* > -Xmx1000m > -Dzookeeper.serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory > -Dzookeeper.ssl.keyStore.location=<XXXX> > -Dzookeeper.ssl.keyStore.password=<XXXX> > -Dzookeeper.ssl.trustStore.location=<XXXX> > -Dzookeeper.ssl.trustStore.password=<XXXX> -Xms12g -Xmx12g -Xmn8g > -Dfsync.warningthresholdms=200 -Dzookeeper.log.threshold=INFO > -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC > -XX:+CMSIncrementalMode -XX:CMSIncrementalDutyCycle=100 > -XX:-CMSIncrementalPacing -XX:PermSize=160m -XX:MaxPermSize=160m > -XX:SurvivorRatio=4 -Dcom.sun.management.jmxremote > -Dcom.sun.management.jmxremote.local.only=false > org.apache.zookeeper.server.quorum.QuorumPeerMain > /opt/zookeeper-3.8.4/conf/zoo.cfg > > Our zoo.cfg (not including server config looks like: > > tickTime=2000 > peerType=participant > dataDir=/var/lib/zookeeper > dataLogDir=/zookeeper/data > clientPort=2181 > secureClientPort=3181 > maxClientCnxns=20000 > initLimit=5 > syncLimit=2 > -- Ken Mamitsuka Staff Software Engineer, Cantina [email protected] | cantina.com
