[jira] [Created] (ZOOKEEPER-3909) Zookeeper Unable to Join the Cluster after it is Restarted

Saswati (Jira) Fri, 07 Aug 2020 10:02:29 -0700

Saswati created ZOOKEEPER-3909:
----------------------------------

             Summary: Zookeeper Unable to Join the Cluster after it is 
Restarted 
                 Key: ZOOKEEPER-3909
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3909
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.5.7
         Environment: All Environments 
            Reporter: Saswati



When we restart a zookeeper, it doesn't successfully join the cluster and start 
serving clients. We see the zookeeper services starts successfully, but it 
stays ideal and throws the message: "This ZooKeeper instance is not currently 
serving requests"

The Zookeeper cluster size is 5. Whenever we feel the need of restarting the 
zookeepers, we do one at a time. There are two ways we restart the zookeepers,
 # just stop the services and start it back up again.
 # stop the services, replace the host, and start it back up again.

And, in both the cases we see the same issue.

-----------

When investigated the zookeepers logs, we see the below errors/warnings,

"[QuorumPeer[myid=1](plain=x.x.x.x:0000)(secure=disabled)] WARN  
org.apache.zookeeper.server.quorum.Learner - Exception when following the leader
[java.io|http://java.io/].IOException: Leaders epoch, xx is less than accepted 
epoch, xy"

-------------------------

But, when we check the current epoch of the leader is always same as the 
accepted epoch.

------------------------

Also, when we get the Zxid of every quorum member, they have the same first 
byte; only the last two numbers change, so we can safely assume that they are 
in sync, I guess.

Somehow this zookeeper that we re restarting sees an advancing of the epoch and 
shuts down as a follower.

--------------

The current solution we have at the moment for this issue is,

stop the zookeeper services --> rename the current zookeeper data directory 
(version-2) --> start it backup again.

It immediately joins the cluster as a follower as it doesn't have any idea of 
the epoch and start serving clients. 

----------



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ZOOKEEPER-3909) Zookeeper Unable to Join the Cluster after it is Restarted

Reply via email to