Re: entire cluster dies with EOFException

Benjamin Reed Sat, 05 Jul 2014 23:51:07 -0700

any chance you are running out of disk space?


On Sun, Jul 6, 2014 at 6:49 AM, Benjamin Reed <[email protected]> wrote:

>
>
>
> On Fri, Jul 4, 2014 at 10:35 PM, Aaron Zimmerman <
> [email protected]> wrote:
>
>> Thanks for getting back to me.
>>
>> Jordan,
>>
>> I don't think we are doing any large nodes or thousands of children.  We
>> are using zookeeper for storm and service discovery, so things are pretty
>> modest.
>>
>> Camille,
>>
>> I've created https://issues.apache.org/jira/browse/ZOOKEEPER-1955, and
>> attached the snapshot causing the EOF exception (I think..?), let me know
>> if you can discover anything from the snapshot.
>>
>> Thanks,
>>
>> Aaron Zimmerman
>>
>>
>>
>> On Fri, Jul 4, 2014 at 4:02 PM, Jordan Zimmerman <
>> [email protected]
>> > wrote:
>>
>> > I’ve seen EOF errors when the 1MB limit has been reached. Check to see
>> if
>> > any ZNodes have thousands of children and/or big payloads.
>> >
>> > -JZ
>> >
>> >
>> > From: Aaron Zimmerman [email protected]
>> > Reply: [email protected] [email protected]
>> > Date: July 4, 2014 at 8:30:09 AM
>> > To: [email protected] [email protected]
>> > Subject:  entire cluster dies with EOFException
>> >
>> > Hi all,
>> >
>> > We have a 5 node zookeeper cluster that has been operating normally for
>> > several months. Starting a few days ago, the entire cluster crashes a
>> few
>> > times per day, all nodes at the exact same time. We can't track down the
>> > exact issue, but deleting the snapshots and logs and restarting
>> resolves.
>> >
>> > We are running exhibitor to monitor the cluster.
>> >
>> > It appears that something bad gets into the logs, causing an
>> EOFException
>> > and this cascades through the entire cluster:
>> >
>> > 2014-07-04 12:55:26,328 [myid:1] - WARN
>> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
>> > following the leader
>> > java.io.EOFException
>> > at java.io.DataInputStream.readInt(DataInputStream.java:375)
>> > at
>> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>> > at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>> >
>> > at
>> >
>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>> > at
>> > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
>> > at
>> >
>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>> > at
>> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
>> > 2014-07-04 12:55:26,328 [myid:1] - INFO
>> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
>> called
>> > java.lang.Exception: shutdown Follower
>> > at
>> > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>> > at
>> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
>> >
>> >
>> > Then the server dies, exhibitor tries to restart each node, and they all
>> > get stuck trying to replay the bad transaction, logging things like:
>> >
>> >
>> > 2014-07-04 12:58:52,734 [myid:1] - INFO [main:FileSnap@83] - Reading
>> > snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0
>> > 2014-07-04 12:58:52,896 [myid:1] - DEBUG
>> > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
>> > /var/lib/zookeeper/version-2/log.300000021
>> > 2014-07-04 12:58:52,915 [myid:1] - DEBUG
>> > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
>> > /var/lib/zookeeper/version-2/log.300000021
>> > 2014-07-04 12:59:25,870 [myid:1] - DEBUG
>> > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
>> > java.io.EOFException:
>> > Failed to read /var/lib/zookeeper/version-2/log.300000021
>> > 2014-07-04 12:59:25,871 [myid:1] - DEBUG
>> > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
>> > /var/lib/zookeeper/version-2/log.300011fc2
>> > 2014-07-04 12:59:25,872 [myid:1] - DEBUG
>> > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
>> > /var/lib/zookeeper/version-2/log.300011fc2
>> > 2014-07-04 12:59:48,722 [myid:1] - DEBUG
>> > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
>> > java.io.EOFException:
>> > Failed to read /var/lib/zookeeper/version-2/log.300011fc2
>> >
>> > And the cluster is dead. The only way we have found to recover is to
>> > delete all of the data and restart.
>> >
>> > Anyone seen this before? Any ideas how I can track down what is causing
>> > the EOFException, or insulate zookeeper from completely crashing?
>> >
>> > Thanks,
>> >
>> > Aaron Zimmerman
>> >
>> >
>>
>
>

Re: entire cluster dies with EOFException

Reply via email to