Hi Damien, I digged out the stacktrace when it happened last time. Hope it helps. The size of txn logs from the client requests should be very small, as the data size is only 4 bytes in our write load test. The "overflowed" closeSessionTxn seems to be the issue.
2021-07-01 16:02:00,837 [myid:3] - ERROR [main:QuorumPeerMain@114] - Unexpected exception, exiting abnormally java.lang.RuntimeException: Unable to run quorum server at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1200) at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1131) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91) Caused by: java.io.IOException: Unreasonable length = 3175014 at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166) at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127) at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159) at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:749) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312) at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:287) at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1145) Cheers, Li On Wed, Jul 28, 2021 at 6:55 PM Li Wang <li4w...@gmail.com> wrote: > > On Wed, Jul 28, 2021 at 5:20 AM Damien Diederen <ddiede...@apache.org> > wrote: > >> >> Hi Li, all, >> >> > When load testing the write operation against Zookeeper 3.7.0, I >> observed a >> > couple of times that the server crashed because the txn log was too >> large >> > and it was not able to load it. >> >> Difficult to say without more details, but I suspect ZOOKEEPER-4306 to >> be the culprit: >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-4306 > > > Yes, ZOOKEEPER-4306 could be the culprit. In my write operation test, all > the nodes were created as ephemeral. > >> >> Would it be possible for you to share the transaction log ZooKeeper >> fails to load? >> > > I observed the issues a couple of times about one month ago. I tried to > investigate this issue more recently, but was not able to reproduce it. > I remembered I read the txn log file using zkTxnLogToolkit.sh and it was > similar to what's mentioned in the > https://issues.apache.org/jira/browse/ZOOKEEPER-4306. > Unfortunately, I didn't save the txn log. > > I will save the txn log if I can reproduce it or it happens again. > > Thanks, > > Li > > >> HTH, -D >> >> >> >> --8<---------------original message------------->8--- >> >> Li Wang <li4w...@gmail.com> writes: >> > Hi, >> > >> > >> > >> > When load testing the write operation against Zookeeper 3.7.0, I >> observed a >> > couple of times that the server crashed because the txn log was too >> large >> > and it was not able to load it. However the data size of write is only 4 >> > bytes in the load test and the *jute.maxbuffer *was set to default (i.e. >> > 1M). The error doesn't always happen. >> > >> > >> > I wonder if anyone has also seen this error or has any idea on what may >> > cause the issue? >> > >> > >> > StackTrace >> > >> > ========= >> > >> > >> > 2021-07-01 16:02:00,837 [myid:3] - ERROR [main:QuorumPeerMain@114] - >> > Unexpected exception, exiting abnormally >> > >> > java.lang.RuntimeException: Unable to run quorum server >> > >> > at >> > >> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1200) >> > >> > at >> org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1131) >> > >> > at >> > >> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229) >> > >> > at >> > >> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137) >> > >> > at >> > >> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91) >> > >> > Caused by: java.io.IOException: Unreasonable length = 3175014 >> > >> > at >> > >> org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166) >> > >> > at >> > >> org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127) >> > >> > at >> org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159) >> > >> > at >> > >> org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:749) >> > >> > at >> > >> org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:361) >> > >> > at >> > >> org.apache.zookeeper.server.persistence.FileTxnSnapLog.lambda$restore$0(FileTxnSnapLog.java:267) >> > >> > at >> > >> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:312) >> > >> > at >> org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:287) >> > >> > at >> > >> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1145) >> > >> > >> > Thanks, >> > >> > >> > Li >> >