Hi, I've logged some more information and the problem seems to be a timing issue. Long story short, the scenario is: zk1(follower) zk2(leader) zk3(follower)
step 1. stop zk1 step 2. stop zk2 step 3. start zk1 result: zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario supported by zookeeper?). When zk2 is started again instead of zk1, cluster is created. Observation: After step 3., zk3 is elected to be a leader and starts listening on port 2888 ( example port from example http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ). Listening takes around 10 seconds but during this time zk1 is not trying to connect. zk1 tries to connect just after zk3 stops listening. a part of a log is: zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts trying to connect at 04:04:19,996 as below. [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting to /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner) java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) This is strange as I couldn't reproduce this on local VM VirtualBox but on VMware reproduction is 100%. thanks, Krzysztof ---------- Forwarded message ---------- From: Krzysztof Rybak <[email protected]> Date: Thu, Oct 20, 2016 at 4:50 PM Subject: Zookeeper exception: Timeout while waiting for epoch from quorum To: [email protected] Hi All, first mail in the group so sorry for possible inconsistency in advance. Zookeeper version is zookeeper-3.4.6. I'm facing a problem when zookeeper is reconfiguring a cluster. Initial state: machine A: zk1(follower) zk2(leader) machine B: zk3(follower) zk1 and zk2 are stopped (in that order). zk1 is started on machine B. zk1 and zk3 are not creating a cluster, status is (using srvr word) 'This ZooKeeper instance is not currently serving requests' A part of a log is: [2016-10-20 04:03:10,053] WARN Unexpected exception (org.apache.zookeeper.server.quorum.QuorumPeer) java.lang.InterruptedException: Timeout while waiting for epoch from quorum at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java: 878) at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:799) [2016-10-20 04:03:10,054] INFO Shutting down (org.apache.zookeeper.server. quorum.Leader) [2016-10-20 04:03:10,054] INFO Shutdown called (org.apache.zookeeper.server. quorum.Leader) java.lang.Exception: shutdown Leader! reason: Forcing shutdown at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:805) [2016-10-20 04:03:10,054] INFO shutting down (org.apache.zookeeper.server. ZooKeeperServer) What is interesting: when zk2(previous leader) is started on machine B (instead of zk1) cluster is configured correctly. The same situation happens when all happen on the single machine. Issue is similar to this, but algorithm used by me is 3 (by default and confirmed with electionAlg=3 in .cfg files) https://issues.apache.org/jira/browse/ZOOKEEPER-2400 thanks, Krzysztof
