Thanks for the report Scott, from what I've seen so far this seems to
be a Linux bug and not specific to java/ZK, here are a couple of the
more informative link's I've seen:
http://hackerne.ws/item?id=4188412
http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix

Anyone have specific insight into how this expressed itself in java?
I've seen some references to futex being the root (from java
perspective) "It's a critical Linux bug that causes futex to timeout,
and anything that uses it to behave incorrectly."

Patrick

On Sun, Jul 1, 2012 at 2:58 PM, Scott Fines <[email protected]> wrote:
> Hello all,
>
> It appears that ZooKeeper is subject to the linux leap seconds bug that has 
> caused problems with Cassandra and other services. At least, I discovered 
> that after 6 hours of trying to figure out why my cluster wasn't giving me a 
> quorum.
>
> A link to the kernel bug report is  at 
> https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
>
> As far as what you might see in your logs, I saw a lost quorum, insanely high 
> load on my servers, and when I shut down zookeeper to bring it back up, one 
> machine would report a read timeout during leader election, then report that 
> the server told it to shut down. After that, it would forever be stuck in the 
> LOOKING phase, while another machine might be stuck in any other phase of the 
> election.
>
> The fix is simple, though. Just stop ZooKeeper, execute
>
> date -s "`date`"
>
> or restart your ntp daemon, then start zookeeper back up.
>
> you MUST restart zookeeper, otherwise, the election state doesn't recover 
> (or, at least, it didn't recover for me)
>
> Hope this helps save someone else the 7 hours of agony I just went through.
>
> Scott Fines

Reply via email to