Hello all,

It appears that ZooKeeper is subject to the linux leap seconds bug that has 
caused problems with Cassandra and other services. At least, I discovered that 
after 6 hours of trying to figure out why my cluster wasn't giving me a quorum.

A link to the kernel bug report is  at 
https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d

As far as what you might see in your logs, I saw a lost quorum, insanely high 
load on my servers, and when I shut down zookeeper to bring it back up, one 
machine would report a read timeout during leader election, then report that 
the server told it to shut down. After that, it would forever be stuck in the 
LOOKING phase, while another machine might be stuck in any other phase of the 
election.

The fix is simple, though. Just stop ZooKeeper, execute

date -s "`date`"

or restart your ntp daemon, then start zookeeper back up.

you MUST restart zookeeper, otherwise, the election state doesn't recover (or, 
at least, it didn't recover for me)

Hope this helps save someone else the 7 hours of agony I just went through.

Scott Fines

Reply via email to