[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15737108#comment-15737108 ] Haitao Yao commented on ZOOKEEPER-2104: --- I encountered the same problem. Is any fead back about whether increase the init limit values will work? thanks. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *** GOODBYE > /204.53.107.247:45953 > 2015-01-04 16:18:21,918 [myid:2] - WARN >
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15396234#comment-15396234 ] Camille Fournier commented on ZOOKEEPER-2104: - Yeah, your init limit needs to be longer. They're not getting into quorum because it takes longer than 20s to sync. Dunno why the original node crashed but if you increase initLimit that should solve this problem. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *** GOODBYE
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15396172#comment-15396172 ] Camille Fournier commented on ZOOKEEPER-2104: - It's hard to tell if this is just that the logs were grabbed at different times or if it is clock drift but I would check for clock drift. I'm also seeing this error though: 2016-07-27 11:47:05,709 [myid:2] - WARN [SyncThread:2:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:2 took ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide It's also taking over 10 seconds to read the snapshot on startup, which is not a good sign. Flavio's advice to increase the initLimit is probably good. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] -
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15396130#comment-15396130 ] Camille Fournier commented on ZOOKEEPER-2104: - Is it possible this is a clock drift problem? The logs you've provided end at 12:13:35 for node1, 12:18:31 for node 2, and 12:14:11 for node3. I can't remember if this degree of clock drift causes issues or not, [~fpj] do you recall? > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN >
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395387#comment-15395387 ] Daniel Freudenberger commented on ZOOKEEPER-2104: - [~fpj] the referenced snapshot is the latest one and all snapshots have around the same size. The log files look pretty much all the same to me. But maybe your eyes will catch sth. that looked good to me. You can download the log files for all nodes if you like (https://s3-eu-west-1.amazonaws.com/files.rebuy-cdn.de/logs.tgz). The transaction log is written to the same device. This is sth. we can't really change in our current setup. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN >
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395346#comment-15395346 ] Flavio Junqueira commented on ZOOKEEPER-2104: - [~d.freudenberger] right, 147mb isn't large. But, this indicates that the follower has timed out waiting on the leader to sync up: {noformat} 2016-07-27 11:49:40,346 [myid:2] - WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader java.net.SocketTimeoutException: Read timed out {noformat} You may want to have a look at the prospective leader logs to see if you spot anything odd. If the log files aren't too large, then you may consider posting them here. Also, is the snapshot you checked the latest one? do all snapshots have roughly that size? About devices, are you using a single device or a dedicated device to the txn log. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395335#comment-15395335 ] Daniel Freudenberger commented on ZOOKEEPER-2104: - [~fpj] of course I read through the comments. Zookeeper recovered after ~15 minutes. 20 minutes later (right now) it crashed again and flooding the log file with following errors: 2016-07-27 11:49:39,829 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.41.199.233:60522 (no session established for client) 2016-07-27 11:49:39,864 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.41.199.201:60524 2016-07-27 11:49:39,865 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2016-07-27 11:49:39,865 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.41.199.201:60524 (no session established for client) 2016-07-27 11:49:40,095 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.41.199.217:37339 2016-07-27 11:49:40,096 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2016-07-27 11:49:40,098 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.41.199.217:37339 (no session established for client) 2016-07-27 11:49:40,245 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.41.199.63:33360 2016-07-27 11:49:40,245 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2016-07-27 11:49:40,245 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.41.199.63:33360 (no session established for client) 2016-07-27 11:49:40,317 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.41.199.111:34965 2016-07-27 11:49:40,320 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2016-07-27 11:49:40,320 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /10.41.199.111:34965 (no session established for client) 2016-07-27 11:49:40,346 [myid:2] - WARN [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152) at org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:272) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:72) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740) 2016-07-27 11:49:40,347 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744) 2016-07-27 11:49:40,347 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@139] - Shutting down 2016-07-27 11:49:40,347 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@419] - shutting down 2016-07-27 11:49:40,348 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@670] - LOOKING 2016-07-27 11:49:40,352 [myid:2] - INFO [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot /var/lib/zookeeper/version-2/snapshot.3300799266 The
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395321#comment-15395321 ] Flavio Junqueira commented on ZOOKEEPER-2104: - [~d.freudenberger] please check the comments in this jira if you haven't done it yet. The "ZooKeeper is not running" messages are due to the server(s) being in leader election. If they can't elect a leader and make progress, then we need to determine why that's the case. To my knowledge, there is nothing to be fixed here unless you provide further evidence of a bug. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15395310#comment-15395310 ] Daniel Freudenberger commented on ZOOKEEPER-2104: - We just ran into exactly the same issue. 3 nodes cluster, suddenly the cluster went down and all nodes reporting "ZooKeeper is not running". Is there something I can do to make someone look into this? > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *** GOODBYE >
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271543#comment-15271543 ] Flavio Junqueira commented on ZOOKEEPER-2104: - If increasing the initLimit value fixes it, then it is probably the case that you have large snapshots. Could you check the size of your snapshot files? > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *** GOODBYE > /204.53.107.247:45953 > 2015-01-04 16:18:21,918
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15259088#comment-15259088 ] Govindaraj commented on ZOOKEEPER-2104: --- [~davidlao] We have the below setting in the initLimit. ``` # synchronization phase can take initLimit=10 ``` > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *** GOODBYE > /204.53.107.247:45953 > 2015-01-04 16:18:21,918 [myid:2] - WARN >
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258851#comment-15258851 ] David Lao commented on ZOOKEEPER-2104: -- Have a look at your initLimit config. and ensure its large enough to allow for followers to fully sync to the leader. In my case, increasing the initLimit solved the problem. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *** GOODBYE > /204.53.107.247:45953 > 2015-01-04
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258776#comment-15258776 ] Govindaraj commented on ZOOKEEPER-2104: --- Hi All, I am also seeing the below error on all of my 3 nodes where zookeeper is installed. Zookeeper is not coming backup. Any thoughts? ``` Apr 26 19:09:13 zookeeper: 2016-04-26 19:09:13,966 [myid:3] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running ``` > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15062554#comment-15062554 ] David Lao commented on ZOOKEEPER-2104: -- Unfortunately I've lost a member server and its snapshots and upon restarting the issue is no longer reproducible. I'll keep an eye on this and provide update as appropriate. Thanks for taking a look. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *** GOODBYE > /204.53.107.247:45953
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058591#comment-15058591 ] David Lao commented on ZOOKEEPER-2104: -- In my case the logs and snapshots are on separate drives. I noticed the net traffics, with the resource monitor app which is part of the Windows OS, were from the leader to followers on port 2888. I was seeing ~15 MB/s traffic between the nodes. Will try to increasing the snapshot count (though I'd think the default 100k transactions is plenty for the particular running workload). There are two distinct ERRORs from the log (see attached), one appears to be due to broken connection and the other is java.nio.channels.CancelledKeyException (reported in ZOOKEEPER-1237). The broken connection error happens only once when server start. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN >
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057829#comment-15057829 ] Flavio Junqueira commented on ZOOKEEPER-2104: - The logs from the first node in the description indicate that syncing to disk is taking too long. Is the disk device shared between logs and snapshots? It is unclear from these logs why the second node abandoned the leader, but looks like a timeout of the socket. [~davidlao] you say the servers were replicating massive amounts of data, how did you notice it? Also, were the servers generating too many snapshots and is this due to a traffic spike? Would increasing snapCount help? Could you observe in the logs why the ensemble wasn't able to come back up? > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 >
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058891#comment-15058891 ] Flavio Junqueira commented on ZOOKEEPER-2104: - The errors in this case don't say much, just that the server can't read/write from the socket. Is there any issue with your disks? Have you checked the disk traffic around the time this happened? I'm assumed this happened once, but let me know if this is reproducible. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > Attachments: zookeeper-errors.txt, zookeeper-warns.txt > > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN >
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056947#comment-15056947 ] David Lao commented on ZOOKEEPER-2104: -- I'm seeing the same issue. Additional observations: when the ensemble gets into this state, the leader and followers appears to be replicating massive amount of data although the curator clients were mostly idle. The on disk foot print for the log and snapshot are 65MB and 500MB respectively and there are plenty of memory and disk space on the servers. > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN >
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15010403#comment-15010403 ] Puneet Sharma commented on ZOOKEEPER-2104: -- We are facing the same issue and its took much longer to get recovered . So our whole purpose of using Zookeeper cluster for HA has been defeated . Please suggest . > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *** GOODBYE > /204.53.107.247:45953 > 2015-01-04 16:18:21,918 [myid:2] - WARN >
[jira] [Commented] (ZOOKEEPER-2104) Sudden crash of all nodes in the cluster
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968035#comment-14968035 ] Ray commented on ZOOKEEPER-2104: We found the same issue, but it recovered after 10 minutes, was this expected? Can we fix this asap? > Sudden crash of all nodes in the cluster > > > Key: ZOOKEEPER-2104 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2104 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.6 >Reporter: Benjamin Jaton > > In a 3 nodes ensemble, suddenly all the nodes seem to fail, displaying > "ZooKeeper is not running" messages. > Not retry seems to be happening after that. > This a request to understand what happened and probably to improve the logs > when it does. > See logs below: > NODE1: > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:23,492 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:24,060 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE2: > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,801 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > 2015-01-04 16:18:22,886 [myid:3] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception > causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not > running > NODE3 (leader): > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - *** GOODBYE > /204.53.107.249:43402 > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - *** GOODBYE > /204.53.107.247:45953 > 2015-01-04 16:18:21,918 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring > unexpected exception > java.lang.InterruptedException >