any chance you are running out of disk space?
On Sun, Jul 6, 2014 at 6:49 AM, Benjamin Reed <[email protected]> wrote: > > > > On Fri, Jul 4, 2014 at 10:35 PM, Aaron Zimmerman < > [email protected]> wrote: > >> Thanks for getting back to me. >> >> Jordan, >> >> I don't think we are doing any large nodes or thousands of children. We >> are using zookeeper for storm and service discovery, so things are pretty >> modest. >> >> Camille, >> >> I've created https://issues.apache.org/jira/browse/ZOOKEEPER-1955, and >> attached the snapshot causing the EOF exception (I think..?), let me know >> if you can discover anything from the snapshot. >> >> Thanks, >> >> Aaron Zimmerman >> >> >> >> On Fri, Jul 4, 2014 at 4:02 PM, Jordan Zimmerman < >> [email protected] >> > wrote: >> >> > I’ve seen EOF errors when the 1MB limit has been reached. Check to see >> if >> > any ZNodes have thousands of children and/or big payloads. >> > >> > -JZ >> > >> > >> > From: Aaron Zimmerman [email protected] >> > Reply: [email protected] [email protected] >> > Date: July 4, 2014 at 8:30:09 AM >> > To: [email protected] [email protected] >> > Subject: entire cluster dies with EOFException >> > >> > Hi all, >> > >> > We have a 5 node zookeeper cluster that has been operating normally for >> > several months. Starting a few days ago, the entire cluster crashes a >> few >> > times per day, all nodes at the exact same time. We can't track down the >> > exact issue, but deleting the snapshots and logs and restarting >> resolves. >> > >> > We are running exhibitor to monitor the cluster. >> > >> > It appears that something bad gets into the logs, causing an >> EOFException >> > and this cascades through the entire cluster: >> > >> > 2014-07-04 12:55:26,328 [myid:1] - WARN >> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when >> > following the leader >> > java.io.EOFException >> > at java.io.DataInputStream.readInt(DataInputStream.java:375) >> > at >> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) >> > at >> > >> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) >> > >> > at >> > >> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) >> > at >> > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152) >> > at >> > >> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) >> > at >> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740) >> > 2014-07-04 12:55:26,328 [myid:1] - INFO >> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown >> called >> > java.lang.Exception: shutdown Follower >> > at >> > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) >> > at >> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744) >> > >> > >> > Then the server dies, exhibitor tries to restart each node, and they all >> > get stuck trying to replay the bad transaction, logging things like: >> > >> > >> > 2014-07-04 12:58:52,734 [myid:1] - INFO [main:FileSnap@83] - Reading >> > snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0 >> > 2014-07-04 12:58:52,896 [myid:1] - DEBUG >> > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream >> > /var/lib/zookeeper/version-2/log.300000021 >> > 2014-07-04 12:58:52,915 [myid:1] - DEBUG >> > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive >> > /var/lib/zookeeper/version-2/log.300000021 >> > 2014-07-04 12:59:25,870 [myid:1] - DEBUG >> > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton >> > java.io.EOFException: >> > Failed to read /var/lib/zookeeper/version-2/log.300000021 >> > 2014-07-04 12:59:25,871 [myid:1] - DEBUG >> > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream >> > /var/lib/zookeeper/version-2/log.300011fc2 >> > 2014-07-04 12:59:25,872 [myid:1] - DEBUG >> > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive >> > /var/lib/zookeeper/version-2/log.300011fc2 >> > 2014-07-04 12:59:48,722 [myid:1] - DEBUG >> > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton >> > java.io.EOFException: >> > Failed to read /var/lib/zookeeper/version-2/log.300011fc2 >> > >> > And the cluster is dead. The only way we have found to recover is to >> > delete all of the data and restart. >> > >> > Anyone seen this before? Any ideas how I can track down what is causing >> > the EOFException, or insulate zookeeper from completely crashing? >> > >> > Thanks, >> > >> > Aaron Zimmerman >> > >> > >> > >
