Saswati created ZOOKEEPER-3909:
----------------------------------
Summary: Zookeeper Unable to Join the Cluster after it is
Restarted
Key: ZOOKEEPER-3909
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3909
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.5.7
Environment: All Environments
Reporter: Saswati
When we restart a zookeeper, it doesn't successfully join the cluster and start
serving clients. We see the zookeeper services starts successfully, but it
stays ideal and throws the message: "This ZooKeeper instance is not currently
serving requests"
The Zookeeper cluster size is 5. Whenever we feel the need of restarting the
zookeepers, we do one at a time. There are two ways we restart the zookeepers,
# just stop the services and start it back up again.
# stop the services, replace the host, and start it back up again.
And, in both the cases we see the same issue.
-----------
When investigated the zookeepers logs, we see the below errors/warnings,
"[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN
org.apache.zookeeper.server.quorum.Learner - Exception when following the leader
[java.io|http://java.io/].IOException: Leaders epoch, xx is less than accepted
epoch, xy"
-------------------------
But, when we check the current epoch of the leader is always same as the
accepted epoch.
------------------------
Also, when we get the Zxid of every quorum member, they have the same first
byte; only the last two numbers change, so we can safely assume that they are
in sync, I guess.
Somehow this zookeeper that we re restarting sees an advancing of the epoch and
shuts down as a follower.
--------------
The current solution we have at the moment for this issue is,
stop the zookeeper services --> rename the current zookeeper data directory
(version-2) --> start it backup again.
It immediately joins the cluster as a follower as it doesn't have any idea of
the epoch and start serving clients.
----------
--
This message was sent by Atlassian Jira
(v8.3.4#803005)