[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058591#comment-15058591
 ] 

David Lao commented on ZOOKEEPER-2104:
--------------------------------------

In my case the logs and snapshots are on separate drives. I noticed the net 
traffics, with the resource monitor app which is part of the Windows OS, were 
from the leader to followers on port 2888. I was seeing ~15 MB/s traffic 
between the nodes. Will try to increasing the snapshot count (though I'd think 
the default 100k transactions is plenty for the particular running workload). 

There are two distinct ERRORs from the log (see attached), one appears to be 
due to broken connection and the other is 
java.nio.channels.CancelledKeyException (reported in ZOOKEEPER-1237).  The 
broken connection error happens only once when server start. 


> Sudden crash of all nodes in the cluster
> ----------------------------------------
>
>                 Key: ZOOKEEPER-2104
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.6
>            Reporter: Benjamin Jaton
>
> In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying 
> "ZooKeeper is not running" messages.
> Not retry seems to be happening after that.
> This a request to understand what happened and probably to improve the logs 
> when it does.
> See logs below:
> NODE1:
> -- no log for several days before this --
> 2015-01-04 16:18:22,259 [myid:1] - WARN  [SyncThread:1:FileTxnLog@321] - 
> fsync-ing the write ahead log in SyncThread:1 took 11024ms which will 
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
> 2015-01-04 16:18:22,380 [myid:1] - WARN  
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
> following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:23,384 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:23,492 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:24,060 [myid:1] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> NODE2:
> -- no log for several days before this --
> 2015-01-04 16:18:21,899 [myid:3] - WARN  
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
> following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at 
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at 
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at 
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at 
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at 
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:22,760 [myid:3] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:22,801 [myid:3] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:22,886 [myid:3] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> NODE3 (leader):
> -- no log for several days before this --
> 2015-01-04 16:18:21,897 [myid:2] - WARN  
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing 
> connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,898 [myid:2] - WARN  
> [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - ******* GOODBYE 
> /204.53.107.249:43402 ********
> 2015-01-04 16:18:21,905 [myid:2] - WARN  
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing 
> connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,907 [myid:2] - WARN  
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - ******* GOODBYE 
> /204.53.107.247:45953 ********
> 2015-01-04 16:18:21,918 [myid:2] - WARN  
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring 
> unexpected exception
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>         at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>         at 
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>         at 
> org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
>         at 
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
> 2015-01-04 16:18:23,003 [myid:2] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:23,007 [myid:2] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> 2015-01-04 16:18:23,115 [myid:2] - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to