Hi Andor, As this is on a production server, I can’t attach the log file entirely, but I can try and get you as much information as I can:
Nearly all of the log file is filled with connection errors from ZooKeeper clients: > WARN NIOServerCnxn – Exception causing close of session 0x0 due to > java.io.IOException: ZooKeeperServer not running > INFO NIOServerCnxn – Closed socket connection for client /<redacted> (no > session established for client) I grabbed all of the IP addresses in the log file and they’re all from clients, no mention of other ZK servers. Looking at ‘Quorum’, I see a lot of: > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181] INFO FastLeaderElection - > Notification time out: 60000 > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181] INFO QuorumCnxManager - Have > smaller server identifier, so dropping the connection: (2, 1) > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181] INFO QuorumCnxManager - Have > smaller server identifier, so dropping the connection: (3, 1) > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181] INFO QuorumCnxManager - Have > smaller server identifier, so dropping the connection: (4, 1) > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181] INFO QuorumCnxManager - Have > smaller server identifier, so dropping the connection: (5, 1) Let me know if there is anything else you think I should look for. If I find anything interesting I’ll share it here. From: Andor Molnar <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, January 25, 2019 at 10:01 To: "[email protected]" <[email protected]> Subject: [**SPAM**] Re: [**SPAM**] RE: ZK Server does not join quorum after restart Hi Ian, Would you please attach logs from all participants of the ensemble or try to find an exception from when the follower is trying to join? Regards, Andor On Fri, Jan 25, 2019 at 1:37 AM Ian Spence <[email protected]<mailto:[email protected]>> wrote: Hi Daniel, Thanks for the quick reply. We use static IP addresses on all of the servers so it did not change after the reboot. Thanks, -Ian From: Daniel Chan <[email protected]<mailto:[email protected]>> on behalf of Daniel Chan < [email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Thursday, January 24, 2019 at 16:36 To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: [**SPAM**] RE: ZK Server does not join quorum after restart If its IP address got changed, then you hit a known bug https://issues.apache.org/jira/browse/ZOOKEEPER-1506 and you need to bounce the cluster. Thanks, Daniel -----Original Message----- From: Ian Spence <[email protected]<mailto:[email protected]><mailto: [email protected]<mailto:[email protected]>>> Sent: Thursday, January 24, 2019 2:36 PM To: [email protected]<mailto:[email protected]><mailto:[email protected]> Subject: ZK Server does not join quorum after restart Hello We have a cluster of 5 ZK servers, all running ZK 3.4.6 on Java 1.8 on CentOS 6. These are physical devices, not virtual machines. One server required hardware maintenance, and was restarted. When the zk software was restarted, it did not rejoin the quorum as a follower. Running “stat” or “mntr” commands returns: “This ZooKeeper instance is not currently serving requests” I googled this message and came across this bug: https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_ZOOKEEPER-2D2164&d=DwIGaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=JE3yjNS4hXa8nS9n2uFCwEqMvv18hzzEnqunUhCoEns&m=S_8TazqwUbEfRtAYQCn8kA7F2tiGUBaVr3c_nj0Fh8A&s=FGIs9YOjwdYrzBH8om70Jx11KemHKRDsMY_kZK6cpK0&e= Does anybody know if there is a work-around to this issue? We’ve seen this problem multiple times in the past and our current solution is to bring down the zk cluster (which is a huge outage-causing pain). Thanks - Ian
