[
https://issues.apache.org/jira/browse/ZOOKEEPER-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983838#comment-15983838
]
Michael Han commented on ZOOKEEPER-2766:
----------------------------------------
What is the latency of the network? Try compare the latency with initLimit /
syncLimit see if they are at similar scale. If not (for example if network is
slow try increase initLimit/syncLimit see if that alleviate the problem).
In certain cases due to ZOOKEEPER-900 / 901 the leader election might takes a
long time to finish.
> Quorum fails with java.io.EOFException
> --------------------------------------
>
> Key: ZOOKEEPER-2766
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2766
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection, quorum
> Affects Versions: 3.5.3
> Environment: CentOS-7, Docker version 17.03.1-ce
> Reporter: Patrick Kleindienst
> Attachments: logs-node1.txt, logs-node2.txt, logs-node3.txt
>
>
> When I start a ZooKeeper ensemble comprising 3 nodes, I'm currently facing
> the following behavior:
> Two nodes (let's say node 2 and 3) out of the three start their own quorum,
> and finally one of them is elected the new leader (node 3), while the other
> one becomes the follower (node 2). Node 1 seems to be able to establish a
> connection to node 3 (elected leader) in my case, but this seems to fail for
> node 2.
> Node 1 shows the following in its logs:
> 2017-04-25 09:24:02,806 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1055]
> - LOOKING
> 2017-04-25 09:24:02,808 [myid:1] - INFO
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FastLeaderElection@894]
> - New election. My id = 1, proposed zxid=0x0
> 2017-04-25 09:24:02,811 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection@688] - Notification: 2 (message
> format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 1 (n.sid), 0x0 (
> n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:24:02,817 [myid:1] - WARN
> [WorkerSender[myid=1]:QuorumCnxManager@457] - Cannot open channel to 2 at
> election address /9.152.171.98:3888
> java.net.ConnectException: Connection refused (Connection refused)
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:443)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:486)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:421)
> at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:486)
> at
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:465)
> at java.lang.Thread.run(Thread.java:745)
> 2017-04-25 09:24:02,822 [myid:1] - INFO
> [WorkerSender[myid=1]:QuorumCnxManager@278] - Have smaller server identifier,
> so dropping the connection: (3, 1)
> 2017-04-25 09:24:03,025 [myid:1] - WARN
> [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@457]
> - Cannot open channel to 2 at election address /9.152.171.98:3888
> However, that's not all, since the quorum consisting of node 2 and 3 does not
> work properly. The nodes' logs tell that leader election between these two
> works fine.
> Here's what node 3 (leader) says:
> 2017-04-25 09:09:33,842 [myid:3] - INFO
> [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message
> format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 3 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config
> version)
> 2017-04-25 09:09:33,844 [myid:3] - INFO
> [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message
> format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config
> version)
> 2017-04-25 09:09:33,851 [myid:3] - INFO
> [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message
> format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config
> version)
> 2017-04-25 09:09:34,051 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128]
> - Unregister MBean
> [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=LeaderElection]
> 2017-04-25 09:09:34,052 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1143]
> - LEADING
> 2017-04-25 09:09:34,055 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@63]
> - TCP NoDelay set to: true
> 2017-04-25 09:09:34,055 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@83]
> - zookeeper.leader.maxConcurrentSnapshots = 10
> 2017-04-25 09:09:34,056 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@85]
> - zookeeper.leader.maxConcurrentSnapshotTimeout = 5
> And here's the output node 2 (follower) provides:
> 2017-04-25 09:09:31,875 [myid:2] - INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@688] - Notification: 2 (message
> format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 2 (n.sid), 0x0 (
> n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:32,077 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128]
> - Unregister MBean
> [org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=repl
> ica.2,name2=LeaderElection]
> 2017-04-25 09:09:32,077 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1131]
> - FOLLOWING
> 2017-04-25 09:09:32,082 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@88]
> - TCP NoDelay set to: true
> So far, so good. But seconds later the connection between node 2 and 3 seems
> to get lost, causing leader node 3 to report an EOFExeption. If I understand
> the logs correctly, node 2 (follower) properly closes the connection (sending
> "Goodbye"), whilst node 3 says that the socket is still open.
> 2017-04-25 09:09:34,190 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@414]
> - LEADING - LEADER ELECTION TOOK - 138 MS
> 2017-04-25 09:09:34,197 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FileTxnSnapLog@320]
> - Snapshotting: 0x0 to /data/version-2/snapshot.0
> 2017-04-25 09:09:35,076 [myid:3] - INFO
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@382] - Follower sid: 2 :
> info : 9.152.171.98:2888:3888:participant;0.0.0.0:2181
> 2017-04-25 09:09:35,113 [myid:3] - INFO
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@683] - Synchronizing with
> Follower sid: 2 maxCommittedLog=0x0 minCommittedLog=0x0 lastProcessedZxid=0x0
> peerLastZxid=0x
> 0
> 2017-04-25 09:09:35,114 [myid:3] - INFO
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@727] - Sending DIFF
> zxid=0x0 for peer sid: 2
> 2017-04-25 09:09:35,133 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@1258]
> - Have quorum of supporters, sids: [ [2, 3],[2, 3] ]; starting up and
> setting last processed zxid: 0x100000000
> 2017-04-25 09:09:35,169 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):CommitProcessor@255]
> - Configuring CommitProcessor with 2 worker threads.
> 2017-04-25 09:09:35,179 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ContainerManager@64]
> - Using checkIntervalMs=60000 maxPerMinute=10000
> 2017-04-25 09:09:35,196 [myid:3] - WARN
> [RecvWorker:2:QuorumCnxManager$RecvWorker@919] - Connection broken for id 2,
> my id = 3, error =
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
> 2017-04-25 09:09:35,196 [myid:3] - WARN
> [RecvWorker:2:QuorumCnxManager$RecvWorker@922] - Interrupting SendWorker
> 2017-04-25 09:09:35,197 [myid:3] - WARN
> [SendWorker:2:QuorumCnxManager$SendWorker@836] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
> 2017-04-25 09:09:35,197 [myid:3] - WARN
> [SendWorker:2:QuorumCnxManager$SendWorker@845] - Send worker leaving thread
> id 2 my id = 3
> 2017-04-25 09:09:35,204 [myid:3] - ERROR
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@604] - Unexpected
> exception causing shutdown while sock still open
> java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
> 2017-04-25 09:09:35,204 [myid:3] - WARN
> [LearnerHandler-/9.152.171.98:51328:LearnerHandler@619] - ******* GOODBYE
> /9.152.171.98:51328 ********
> 2017-04-25 09:09:37,181 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@626]
> - Shutting down
> 2017-04-25 09:09:37,182 [myid:3] - INFO
> [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@632]
> - Shutdown called
> java.lang.Exception: shutdown Leader! reason: Not sufficient followers
> synced, only synced with sids: [ [3] ]
> at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:632)
> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:612)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1146)
> Unfortunately, node 2 does not provide any additional information on what
> exactly is going on. After leader election, the only thing it reports is this:
> 2017-04-25 09:09:32,091 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@68]
> - FOLLOWING - LEADER ELECTION TOOK - 13 MS
> 2017-04-25 09:09:32,094 [myid:2] - WARN
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@273]
> - Unexpected exception, tries=0, remaining init limit=9999, connecting to
> /9.152.171.12:288
> 8
> java.net.ConnectException: Connection refused (Connection refused)
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at
> org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
> at
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
> at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> 2017-04-25 09:09:33,142 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@369]
> - Getting a diff from the leader 0x0
> 2017-04-25 09:09:33,146 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@516]
> - Learner received NEWLEADER message
> 2017-04-25 09:09:33,207 [myid:2] - INFO
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@499]
> - Learner received UPTODATE message
> 2017-04-25 09:09:33,220 [myid:2] - WARN
> [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1446]
> - Restarting Leader Election
> 2017-04-25 09:09:33,221 [myid:2] - INFO
> [/0.0.0.0:3888:QuorumCnxManager$Listener@665] - Leaving listener
> 2017-04-25 09:09:33,222 [myid:2] - WARN
> [SendWorker:3:QuorumCnxManager$SendWorker@836] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
> 2017-04-25 09:09:33,222 [myid:2] - WARN
> [RecvWorker:3:QuorumCnxManager$RecvWorker@919] - Connection broken for id 3,
> my id = 2, error =
> java.net.SocketException: Socket closed
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.net.SocketInputStream.read(SocketInputStream.java:224)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
> As far as I get that, node 2 wants to start a new leader election, but fails
> to establish a connection to the other nodes. It tries over and over, finally
> ending up in a timeout. Unfortunately, this doesn't give me any hint on what
> exactly breaks up the connection between the follower node (node 2) and the
> leader node (node 3) and why it can be re-established.
> It might also be relevant that I'm running ZooKeeper in Docker containers,
> using the host network option.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)