Have you tried increasing syncLimit? ==================== Jordan Zimmerman
> On May 10, 2017, at 9:29 PM, Ben Sherman <[email protected]> wrote: > > I have been chasing this problem for months and I can't figure out what is > happening. > > We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as well), > and sometimes, a node gets a read timeout, drops all the connections and > tries to re-establish itself to the quorum. It can usually do this in a > few seconds, but last night it took almost 15 minutes to reconnect. > > These are 5 servers in AWS, and we've tried tuning the timeouts, but the > are exceeding any reasonable timeout and still failing. > > Any hints here? > > 2017-05-10 11:21:34,386 [myid:5] - WARN > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937) > 2017-05-10 11:21:34,387 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called > java.lang.Exception: shutdown Follower > at > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941) > 2017-05-10 11:21:34,387 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called > 2017-05-10 11:21:34,388 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed > socket connection for client /192.168.23.109:40666 which had sessionid > 0x55bd553c076002d > [snip .. zk closes all its connections] > 2017-05-10 11:21:34,391 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed > socket connection for client /192.168.23.88:65025 which had sessionid > 0x55bd553c0760765 > 2017-05-10 11:21:34,391 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - > Shutting down > 2017-05-10 11:21:34,391 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] - shutting > down > 2017-05-10 11:21:34,391 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] - > Shutting down > 2017-05-10 11:21:34,391 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting > down > 2017-05-10 11:21:34,392 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] - > shutdown of request processor complete > 2017-05-10 11:21:34,392 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] - > Shutting down > 2017-05-10 11:21:34,392 [myid:5] - INFO > [CommitProcessor:5:CommitProcessor@153] - CommitProcessor exited loop! > 2017-05-10 11:21:34,392 [myid:5] - INFO > [SyncThread:5:SyncRequestProcessor@186] - SyncRequestProcessor exited! > 2017-05-10 11:21:34,392 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING > 2017-05-10 11:21:34,393 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New > election. My id = 5, proposed zxid=0x4100013987 > 2017-05-10 11:21:34,393 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round), > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:21:34,393 [myid:5] - INFO > [FollowerRequestProcessor:5:FollowerRequestProcessor@97] - > FollowerRequestProcessor exited loop! > 2017-05-10 11:21:34,593 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > Notification time out: 400 > 2017-05-10 11:21:34,594 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round), > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:21:34,994 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > Notification time out: 800 > 2017-05-10 11:21:34,994 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round), > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:21:35,794 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > Notification time out: 1600 > 2017-05-10 11:21:35,795 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round), > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:21:37,395 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > Notification time out: 3200 > 2017-05-10 11:21:37,395 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round), > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:21:40,595 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > Notification time out: 6400 > 2017-05-10 11:21:40,596 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round), > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:21:46,996 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > Notification time out: 12800 > 2017-05-10 11:21:46,996 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round), > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:21:59,796 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > Notification time out: 25600 > 2017-05-10 11:21:59,797 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round), > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > --- > MANY minutes later, we see things finally timeout, connections are > re-established, and everything comes back: > --- > 2017-05-10 11:37:02,469 [myid:5] - WARN > [RecvWorker:3:QuorumCnxManager$RecvWorker@1028] - Connection broken for id > 3, my id = 5, error = > java.net.SocketException: Connection timed out (Read failed) > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) > 2017-05-10 11:37:02,470 [myid:5] - WARN > [RecvWorker:3:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker > 2017-05-10 11:37:02,469 [myid:5] - WARN > [RecvWorker:2:QuorumCnxManager$RecvWorker@1028] - Connection broken for id > 2, my id = 5, error = > java.net.SocketException: Connection timed out (Read failed) > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) > 2017-05-10 11:37:02,470 [myid:5] - WARN > [RecvWorker:2:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker > 2017-05-10 11:37:02,469 [myid:5] - WARN > [RecvWorker:4:QuorumCnxManager$RecvWorker@1028] - Connection broken for id > 4, my id = 5, error = > java.net.SocketException: Connection timed out (Read failed) > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) > 2017-05-10 11:37:02,470 [myid:5] - WARN > [RecvWorker:4:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker > 2017-05-10 11:37:02,470 [myid:5] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@944] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932) > 2017-05-10 11:37:02,471 [myid:5] - WARN > [SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving thread > 2017-05-10 11:37:02,470 [myid:5] - WARN > [SendWorker:4:QuorumCnxManager$SendWorker@944] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932) > 2017-05-10 11:37:02,471 [myid:5] - WARN > [SendWorker:4:QuorumCnxManager$SendWorker@954] - Send worker leaving thread > 2017-05-10 11:37:02,472 [myid:5] - WARN > [SendWorker:2:QuorumCnxManager$SendWorker@944] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932) > 2017-05-10 11:37:02,472 [myid:5] - WARN > [SendWorker:2:QuorumCnxManager$SendWorker@954] - Send worker leaving thread > 2017-05-10 11:37:03,493 [myid:5] - WARN > [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Connection broken for id > 1, my id = 5, error = > java.net.SocketException: Connection timed out (Read failed) > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:171) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013) > 2017-05-10 11:37:03,493 [myid:5] - WARN > [RecvWorker:1:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker > 2017-05-10 11:37:03,494 [myid:5] - WARN > [SendWorker:1:QuorumCnxManager$SendWorker@944] - Interrupted while waiting > for message on queue > java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > at > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74) > at > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932) > 2017-05-10 11:37:03,494 [myid:5] - WARN > [SendWorker:1:QuorumCnxManager$SendWorker@954] - Send worker leaving thread > 2017-05-10 11:37:16,607 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - > Notification time out: 60000 > 2017-05-10 11:37:16,611 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round), > FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:37:16,611 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round), > FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:37:16,611 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round), > FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:37:16,611 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round), > FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:37:16,613 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round), > LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:37:16,613 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round), > LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state) > 2017-05-10 11:37:16,613 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@935] - FOLLOWING > 2017-05-10 11:37:16,613 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created > server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 > datadir /opt/data-zookeeper/version-2 snapdir /opt/data-zookeeper/version-2 > 2017-05-10 11:37:16,613 [myid:5] - INFO > [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@64] - FOLLOWING - LEADER > ELECTION TOOK - 942220 > 2017-05-10 11:37:16,615 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round), > LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state) > 2017-05-10 11:37:16,615 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round), > FOLLOWING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state) > 2017-05-10 11:37:16,615 [myid:5] - INFO > [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message > format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round), > FOLLOWING (n.state), 4 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
