We were running version 3.2.2 for about a month and it was working well for us. Then late this past Saturday night, our cluster went pathological. One of the 3 ZK servers spewed many WARNs (see below), and the other 2 servers were almost constantly pegging the CPU. All three servers are on separate machines. From what we could tell, the machines were fine... networking fine, disk fine, etc. The ZK clients were completely unable to complete their connections to ZK.
We tried all sorts of restarts, running zkCleanup, etc. We even completely shut down our clients... and the pathology continued. Our workaround was to do an urgent upgrade to version 3.3.0. The new ZK cluster with 3.3.0 has been running well for us... so far... I realize that, sadly, this message doesn't contain nearly enough details to trace exactly what happened. I guess I'm wondering if anyone has seen this general scenario, and/or knows how to prevent? Is there anything we might be doing client side to trigger this? Our application level request frequency is maybe a few requests to Zookeeper per second, times 5 clients applications. If we detect a SESSION EXPIRED, we do a simple "create new client and use that instead". And we were seeing this happen occasionally. Many, many thanks in advance for any insights or advice. Example log output from the ZK server that spewed many WARNs: === 2010-04-24 22:31:24,551 - WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@318] - Exception when following the leader java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:66) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) at org.apache.zookeeper.server.quorum.Follower.readPacket(Follower.java:114) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:193) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:525) 2010-04-24 22:31:24,552 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@436] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:436) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:529) 2010-04-24 22:31:24,552 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@514] - LOOKING ... 2010-04-24 22:31:27,257 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2010-04-24 22:31:27,257 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/127.0.0.1:2181remote=/ 127.0.0.1:44589] 2010-04-24 22:31:27,705 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:fastleaderelect...@706] - Notification: 3, 60129542145, 2, 1, LOOKING, LEADING, 3 2010-04-24 22:31:27,705 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@523] - FOLLOWING 2010-04-24 22:31:27,705 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:zookeeperser...@160] - Created server 2010-04-24 22:31:27,706 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@147] - Following sum08/ 10.0.10.118:2888 2010-04-24 22:31:29,040 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2010-04-24 22:31:29,040 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/127.0.0.1:2181remote=/ 127.0.0.1:44591] 2010-04-24 22:31:31,069 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2010-04-24 22:31:31,069 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/127.0.0.1:2181remote=/ 127.0.0.1:44592] 2010-04-24 22:31:32,589 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running === Example log output from a server that was pegging CPU: 010-04-24 23:05:27,911 - INFO [WorkerReceiver Thread:fastleaderelection$messenger$workerrecei...@254] - Sending new notification. 2010-04-24 23:05:27,912 - INFO [WorkerReceiver Thread:fastleaderelection$messenger$workerrecei...@254] - Sending new notification. 2010-04-24 23:05:27,910 - WARN [Thread-15:quorumcnxmanager$sendwor...@539] - Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934) at java.util.concurrent.ArrayBlockingQueue.take(ArrayBlockingQueue.java:317) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:533) 2010-04-24 23:05:27,912 - INFO [WorkerReceiver Thread:fastleaderelection$messenger$workerrecei...@254] - Sending new notification. 2010-04-24 23:05:27,912 - WARN [Thread-16:quorumcnxmanager$recvwor...@623] - Connection broken: java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:185) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:263) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:594) 2010-04-24 23:05:27,912 - WARN [Thread-15:quorumcnxmanager$sendwor...@554] - Send worker leaving thread ... 010-04-24 23:05:51,952 - INFO [WorkerReceiver Thread:fastleaderelection$messenger$workerrecei...@254] - Sending new notification. 2010-04-24 23:05:57,148 - INFO [WorkerReceiver Thread:fastleaderelection$messenger$workerrecei...@254] - Sending new notification. 2010-04-24 23:06:03,243 - INFO [FollowerHandler-/10.0.10.116:34405 :followerhand...@227] - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.quorumpeer$quorumser...@364641 2010-04-24 23:06:03,244 - WARN [FollowerHandler-/10.0.10.116:34405 :followerhand...@302] - Sending snapshot last zxid of peer is 0xd0007d66d zxid of leader is 0xf00000000 2010-04-24 23:06:03,254 - ERROR [FollowerHandler-/10.0.10.116:34405 :followerhand...@415] - Unexpected exception causing shutdown while sock still open java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) at org.apache.jute.BinaryOutputArchive.writeLong(BinaryOutputArchive.java:59) at org.apache.zookeeper.server.DataNode.serialize(DataNode.java:129) at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:917) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:929) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:929) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:929) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:929) at org.apache.zookeeper.server.DataTree.serialize(DataTree.java:979) at org.apache.zookeeper.server.util.SerializeUtils.serializeSnapshot(SerializeUtils.java:104) at org.apache.zookeeper.server.ZooKeeperServer.serializeSnapshot(ZooKeeperServer.java:272) at org.apache.zookeeper.server.quorum.FollowerHandler.run(FollowerHandler.java:307) 2010-04-24 23:06:03,255 - WARN [FollowerHandler-/10.0.10.116:34405 :followerhand...@421] - ******* GOODBYE /10.0.10.116:34405 ******** 2010-04-24 23:06:03,550 - INFO [WorkerReceiver Thread:fastleaderelection$messenger$workerrecei...@254] - Sending new notification.