Thanks for the report Scott, from what I've seen so far this seems to be a Linux bug and not specific to java/ZK, here are a couple of the more informative link's I've seen: http://hackerne.ws/item?id=4188412 http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix
Anyone have specific insight into how this expressed itself in java? I've seen some references to futex being the root (from java perspective) "It's a critical Linux bug that causes futex to timeout, and anything that uses it to behave incorrectly." Patrick On Sun, Jul 1, 2012 at 2:58 PM, Scott Fines <[email protected]> wrote: > Hello all, > > It appears that ZooKeeper is subject to the linux leap seconds bug that has > caused problems with Cassandra and other services. At least, I discovered > that after 6 hours of trying to figure out why my cluster wasn't giving me a > quorum. > > A link to the kernel bug report is at > https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d > > As far as what you might see in your logs, I saw a lost quorum, insanely high > load on my servers, and when I shut down zookeeper to bring it back up, one > machine would report a read timeout during leader election, then report that > the server told it to shut down. After that, it would forever be stuck in the > LOOKING phase, while another machine might be stuck in any other phase of the > election. > > The fix is simple, though. Just stop ZooKeeper, execute > > date -s "`date`" > > or restart your ntp daemon, then start zookeeper back up. > > you MUST restart zookeeper, otherwise, the election state doesn't recover > (or, at least, it didn't recover for me) > > Hope this helps save someone else the 7 hours of agony I just went through. > > Scott Fines
