Fwd: Zookeeper exception: Timeout while waiting for epoch from quorum

Krzysztof Rybak Mon, 24 Oct 2016 03:38:08 -0700

Hi,
I've logged some more information and the problem seems to be a timing
issue.
Long story short, the scenario is:
zk1(follower)
zk2(leader)
zk3(follower)


step 1. stop zk1
step 2. stop zk2
step 3. start zk1

result:
zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario
supported by zookeeper?).
When zk2 is started again instead of zk1, cluster is created.

Observation:
After step 3., zk3 is elected to be a leader and starts listening on port
2888
( example port from example
http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ). Listening
takes around 10 seconds but during this time zk1 is not trying to connect.
zk1 tries to connect just after zk3 stops listening.

a part of a log is:
zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts trying
to connect at 04:04:19,996 as below.

[2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting to
/10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at
org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)

This is strange as I couldn't reproduce this on local VM VirtualBox but on
VMware reproduction is 100%.

thanks,
Krzysztof

---------- Forwarded message ----------
From: Krzysztof Rybak <[email protected]>
Date: Thu, Oct 20, 2016 at 4:50 PM
Subject: Zookeeper exception: Timeout while waiting for epoch from quorum
To: [email protected]


Hi All,
first mail in the group so sorry for possible inconsistency in advance.
Zookeeper version is zookeeper-3.4.6.

I'm facing a problem when zookeeper is reconfiguring a cluster.

Initial state:
machine A:
zk1(follower)
zk2(leader)
machine B:
zk3(follower)

zk1 and zk2 are stopped (in that order).
zk1 is started on machine B.
zk1 and zk3 are not creating a cluster, status is (using srvr word)
'This ZooKeeper instance is not currently serving requests'

A part of a log is:
[2016-10-20 04:03:10,053] WARN Unexpected exception
(org.apache.zookeeper.server.quorum.QuorumPeer)
java.lang.InterruptedException: Timeout while waiting for epoch from quorum
at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:
878)
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:799)
[2016-10-20 04:03:10,054] INFO Shutting down (org.apache.zookeeper.server.
quorum.Leader)
[2016-10-20 04:03:10,054] INFO Shutdown called (org.apache.zookeeper.server.
quorum.Leader)
java.lang.Exception: shutdown Leader! reason: Forcing shutdown
at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:805)
[2016-10-20 04:03:10,054] INFO shutting down (org.apache.zookeeper.server.
ZooKeeperServer)

What is interesting: when zk2(previous leader) is started on machine B
(instead of zk1) cluster is configured correctly.
The same situation happens when all happen on the single machine.

Issue is similar to this, but algorithm used by me is 3 (by default and
confirmed with electionAlg=3 in .cfg files)
https://issues.apache.org/jira/browse/ZOOKEEPER-2400

thanks,
Krzysztof

Fwd: Zookeeper exception: Timeout while waiting for epoch from quorum

Reply via email to