Unfortunately I don't see any attached logs, which makes it difficult to provide you with insight. "Not sufficient followers synced" indicates that you're losing followers, likely they are falling behind - what is your metric tracking telling your wrt load on the compute and load on the disk/memory/network/etc... also metrics at the ZK level (e.g. are zk latencies increasing?) Check the logs to see if you're seeing "fsync" slowness issues (it's a warning in the server logs). This is a pretty common issue. GC might also be an issue, although that's more rare these days (hard to say w/o knowing your use case, etc...) Again, look to your metrics collection for insight where to start.
Patrick On Wed, Oct 4, 2017 at 11:17 AM, Anand Parthasarathy < [email protected]> wrote: > Hi, > > We have an issue with a 3-node zookeeper ensemble where the quorum goes > down due to no apparent reason every once in a while. Here is what I see in > the ZK leader: > > 2017-09-21 03:00:03,648 [myid:3] - INFO [QuorumPeer[myid=3]/127.0.0.1: > 5002:Leader@493] - Shutting down > 2017-09-21 03:00:03,648 [myid:3] - INFO [QuorumPeer[myid=3]/127.0.0.1: > 5002:Leader@499] - Shutdown called > java.lang.Exception: shutdown Leader! reason: Not sufficient followers > synced, only synced with sids: [ 3 ] > at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499) > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:474) > at org.apache.zookeeper.server.quorum.QuorumPeer.run( > QuorumPeer.java:799) > > I have attached the logs from the 3 nodes around this time. Could you pls. > help understand what the issue could be here. The only thing I see a little > bit ahead of this timestamp is that all of them did a PurgeTask pretty much > at the same time. > > Thanks, > Anand. >
