[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271543#comment-15271543 ]
Flavio Junqueira commented on ZOOKEEPER-2104: --------------------------------------------- If increasing the initLimit value fixes it, then it is probably the case that you have large snapshots. Could you check the size of your snapshot files? > Sudden crash of all nodes in the cluster > ---------------------------------------- > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.6 > Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - ******* GOODBYE > /204.53.107.249:43402 ******** > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - ******* GOODBYE > /204.53.107.247:45953 ******** > 2015-01-04 16:18:21,918 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring > unexpected exception > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) > at > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338) > at > org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656) > at > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649) > 2015-01-04 16:18:23,003 [myid:2] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,007 [myid:2] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,115 [myid:2] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running -- This message was sent by Atlassian JIRA (v6.3.4#6332)