Re: Logger hierarchies in ZK?

2010-07-20 Thread Travis Crawford
On Tue, Jul 20, 2010 at 6:07 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 It is pretty easy to keep configuration files in general in ZK and reload
 them on change.  Very handy some days!

We recently open-sourced tool to handle stuff like config reloads,
triggering actions, etc:

http://github.com/twitter/twitcher

Short version is a single daemon sets your watches and triggers local
actions when stuff happens. If your app doesn't speak ZK this might be
a good solution.

--travis




 On Tue, Jul 20, 2010 at 5:38 PM, ewhau...@gmail.com wrote:

  Has anyone experimented with storing logger hierarchies in ZK? I'm looking
  for a mechanism to dynamically change logger settings across a cluster of
  daemons. An app that connects to all servers via JMX would solve the
  problem; but we have a number of subsystems that do not run on the JVM so
  JMX is not a complete solution. Thanks.
 


Re: zookeeper crash

2010-07-06 Thread Travis Crawford
Hey all -

I believe we just suffered an outage from this issue. Short version is
while restarting quorum members with GC flags recommended in the
Troubleshooting wiki page a follower logged messages similar two the
following jiras:

2010-07-06 23:14:01,438 - FATAL
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@71] - Leader epoch 20 is
less than our epoch 21
2010-07-06 23:14:01,438 - WARN
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@82] - Exception when
following the leader
java.io.IOException: Error: Epoch of leader is lower

https://issues.apache.org/jira/browse/ZOOKEEPER-335
https://issues.apache.org/jira/browse/ZOOKEEPER-790

Reading through the jira's its unclear if the issue is well understood
at this point (as there's a patch available) or still being
understood.

If its still being understood let me know and I can attach the
relevant log lines to the appropriate jira.

Or if the patch appears good I can make a new release and help test.
Let me know :)

--travis





On Wed, Jun 16, 2010 at 3:25 PM, Flavio Junqueira f...@yahoo-inc.com wrote:
 I would recommend opening a separate jira issue. I'm not convinced the
 issues are the same, so I'd rather keep them separate and link the issues if
 it is the case.

 -Flavio

 On Jun 17, 2010, at 12:16 AM, Patrick Hunt wrote:

 We are unable to reproduce this issue. If you can provide the server
 logs (all servers) and attach them to the jira it would be very helpful.
 Some detail on the approx time of the issue so we can correlate to the
 logs would help too (summary of what you did/do to cause it, etc...
 anything that might help us nail this one down).

 https://issues.apache.org/jira/browse/ZOOKEEPER-335

 Some detail on ZK version, OS, Java version, HW info, etc... would also
 be of use to us.

 Patrick

 On 06/16/2010 02:49 PM, Vishal K wrote:

 Hi,

 We are running into this bug very often (almost 60-75% hit rate) while
 testing our newly developed application over ZK. This is almost a blocker
 for us. Will the fix be simplified if backward compatibility was not an
 issue?

 Considering that this bug is rarely reported, I am wondering why we are
 running into this problem so often. Also, on a side note, I am curious
 why
 the systest that comes with ZooKeeper did not detect this bug. Can anyone
 please give an overview of the problem?

 Thanks.
 -Vishal


 On Wed, Jun 2, 2010 at 8:17 PM, Charity Majorschar...@shopkick.com
  wrote:

 Sure thing.

 We got paged this morning because backend services were not able to
 write
 to the database.  Each server discovers the DB master using zookeeper,
 so
 when zookeeper goes down, they assume they no longer know who the DB
 master
 is and stop working.

 When we realized there were no problems with the database, we logged in
 to
 the zookeeper nodes.  We weren't able to connect to zookeeper using
 zkCli.sh
 from any of the three nodes, so we decided to restart them all, starting
 with node one.  However, after restarting node one, the cluster started
 responding normally again.

 (The timestamps on the zookeeper processes on nodes two and three *are*
 dated today, but none of us restarted them.  We checked shell histories
 and
 sudo logs, and they seem to back us up.)

 We tried getting node one to come back up and join the cluster, but
 that's
 when we realized we weren't getting any logs, because log4j.properties
 was
 in the wrong location.  Sorry -- I REALLY wish I had those logs for you.
  We
 put log4j back in place, and that's when we saw the spew I pasted in my
 first message.

 I'll tack this on to ZK-335.



 On Jun 2, 2010, at 4:17 PM, Benjamin Reed wrote:

 charity, do you mind going through your scenario again to give a
 timeline for the failure? i'm a bit confused as to what happened.

 ben

 On 06/02/2010 01:32 PM, Charity Majors wrote:

 Thanks.  That worked for me.  I'm a little confused about why it threw

 the entire cluster into an unusable state, though.

 I said before that we restarted all three nodes, but tracing back, we

 actually didn't.  The zookeeper cluster was refusing all connections
 until
 we restarted node one.  But once node one had been dropped from the
 cluster,
 the other two nodes formed a quorum and started responding to queries on
 their own.

 Is that expected as well?  I didn't see it in ZOOKEEPER-335, so
 thought

 I'd mention it.



 On Jun 2, 2010, at 11:49 AM, Patrick Hunt wrote:


 Hi Charity, unfortunately this is a known issue not specific to 3.3

 that

 we are working to address. See this thread for some background:



 http://zookeeper-user.578899.n2.nabble.com/odd-error-message-td4933761.html

 I've raised the JIRA level to blocker to ensure we address this
 asap.

 As Ted suggested you can remove the datadir -- only on the effected
 server -- and then restart it. That should resolve the issue (the

 server

 will d/l a snapshot of the current db from the leader).

 Patrick

 On 06/02/2010 11:11 AM, Charity Majors wrote:

 I upgraded my zookeeper cluster 

Zookeeper outage recap questions

2010-07-01 Thread Travis Crawford
Hey zookeepers -

We just experienced a total zookeeper outage, and here's a quick
post-mortem of the issue, and some questions about preventing it going
forward. Quick overview of the setup:

- RHEL5 2.6.18 kernel
- Zookeeper 3.3.0
- ulimit raised to 65k files
- 3 cluster members
- 4-5k connections in steady-state
- Primarily C and python clients, plus some java

In chronological order, the issue manifested itself as alert about RW
tests failing. Logs were full of too many files errors, and the output
of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was
100%. Application logs showed lots of connection timeouts. This
suggests an event happened that caused applications to dogpile on
Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles
to run out and basically game over.

I looked through lots of logs (clients+servers) and did not see a
clear indication of what happened. Graphs show a sudden decrease in
network traffic when the outage began, zookeeper goes cpu bound, and
runs our of file descriptors.

Clients are primarily a couple thousand C clients using default
connection parameters, and a couple thousand python clients using
default connection parameters.

Digging through Jira we see two issues that probably contributed to this outage:

https://issues.apache.org/jira/browse/ZOOKEEPER-662
https://issues.apache.org/jira/browse/ZOOKEEPER-517

Both are tagged for the 3.4.0 release. Anyone know if that's still the
case, and when 3.4.0 is roughly scheduled to ship?

Thanks!
Travis


Re: Zookeeper outage recap questions

2010-07-01 Thread Travis Crawford
I've moved this thread to:

https://issues.apache.org/jira/browse/ZOOKEEPER-801

--travis


On Thu, Jul 1, 2010 at 12:37 AM, Patrick Hunt ph...@apache.org wrote:
 Hi Travis, as Flavio suggested would be great to get the logs. A few
 questions:

 1) how did you eventually recover, restart the zk servers?

 2) was the cluster losing quorum during this time? leader re-election?

 3) Any chance this could have been initially triggered by a long GC pause on
 one of the servers? (is gc logging turned on, any sort of heap monitoring?)
 Has the GC been tuned on the servers, for example CMS and incremental?

 4) what are the clients using for timeout on the sessions?

 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a
 few weeks to fix a couple critical issues (which don't seem related to what
 you saw). If we can identify the problem here we should be able to include
 it in any fix release we do.

 fixing something like 517 might help, but it's not clear how we got to this
 state in the first place. fixing 517 might not have any effect if the root
 cause is not addressed. 662 has only ever been reported once afaik, and we
 weren't able to identify the root cause for that one.

 One thing we might also consider is modifying the zk client lib to backoff
 connection attempts if they keep failing (timing out say). Today the clients
 are pretty aggressive on reconnection attempts. Having some sort of backoff
 (exponential?) would provide more breathing room to the server in this
 situation.

 Patrick

 On 06/30/2010 11:13 PM, Travis Crawford wrote:

 Hey zookeepers -

 We just experienced a total zookeeper outage, and here's a quick
 post-mortem of the issue, and some questions about preventing it going
 forward. Quick overview of the setup:

 - RHEL5 2.6.18 kernel
 - Zookeeper 3.3.0
 - ulimit raised to 65k files
 - 3 cluster members
 - 4-5k connections in steady-state
 - Primarily C and python clients, plus some java

 In chronological order, the issue manifested itself as alert about RW
 tests failing. Logs were full of too many files errors, and the output
 of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was
 100%. Application logs showed lots of connection timeouts. This
 suggests an event happened that caused applications to dogpile on
 Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles
 to run out and basically game over.

 I looked through lots of logs (clients+servers) and did not see a
 clear indication of what happened. Graphs show a sudden decrease in
 network traffic when the outage began, zookeeper goes cpu bound, and
 runs our of file descriptors.

 Clients are primarily a couple thousand C clients using default
 connection parameters, and a couple thousand python clients using
 default connection parameters.

 Digging through Jira we see two issues that probably contributed to this
 outage:

     https://issues.apache.org/jira/browse/ZOOKEEPER-662
     https://issues.apache.org/jira/browse/ZOOKEEPER-517

 Both are tagged for the 3.4.0 release. Anyone know if that's still the
 case, and when 3.4.0 is roughly scheduled to ship?

 Thanks!
 Travis



Re: ZKClient

2010-05-04 Thread Travis Crawford
On Tue, May 4, 2010 at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 Travis,

 Attachments are stripped from this mailing list.  Can you file a JIRA and
 put your attachment on that instead?

 Here is a link to get you started:
 https://issues.apache.org/jira/browse/ZOOKEEPER

Whoops. Filed:

https://issues.apache.org/jira/browse/ZOOKEEPER-765

--travis



 On Tue, May 4, 2010 at 3:43 PM, Travis Crawford 
 traviscrawf...@gmail.comwrote:

 Attached is a skeleton application I extracted from a script I use --
 perhaps we could add this as a recipe? If there are issues I'm more
 than happy to fix them, or add more comments, whatever. It took a
 while to figure this out and I'd love to save others that time in the
 future.

 --travis


 On Tue, May 4, 2010 at 3:16 PM, Mahadev Konar maha...@yahoo-inc.com
 wrote:
  Hi Adam,
   I don't think zk is very very hard to get right. There are exmaples in
  src/recipes which implements locks/queues/others. There is ZOOKEEPER-22
 to
  make it even more easier for application to use.
 
  Regarding re registration of watches, you can deifnitely write code and
  submit is as a part of well documented contrib module which lays out the
  assumptions/design of it. It could very well be useful for others. Its
 just
  that folks havent had much time to focus on these areas as yet.
 
  Thanks
  mahadev
 
 
  On 5/4/10 2:58 PM, Adam Rosien a...@rosien.net wrote:
 
  I use zkclient in my work at kaChing and I have mixed feelings about
  it. On one hand it makes easy things easy which is great, but on the
  other hand I very few ideas what assumptions it makes under the
  hood. I also dislike some of the design choices such as unchecked
  exceptions, but that's neither here nor there. It would take some
  extensive documentation work by the authors to really enumerate the
  model and assumptions, but the project doesn't seem to be active
  (either from it being adequate for its current users or just
  inactive). I'm not sure I could derive the assumptions myself.
 
  I'm a bit frustrated that zk is very, very hard to really get right.
  At a project level, can't we create structures to avoid most of these
  errors? Can there be a standard model with detailed assumptions and
  implementations of all the recipes? How can we start this? Is there
  something that makes this too hard?
 
  I feel like a recipe page is a big fail; wouldn't an example app that
  uses locks and barriers be that much more compelling?
 
  For the common FAQ items like you need to re-register the watch,
  can't we just create code that implements this pattern? My goal is to
  live up to the motto: a good API is impossible to use incorrectly.
 
  .. Adam
 
  On Tue, May 4, 2010 at 2:21 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  In general, writing this sort of layer on top of ZK is very, very hard
 to
  get really right for general use.  In a simple use-case, you can
 probably
  nail it but distributed systems are a Zoo, to coin a phrase.  The
 problem is
  that you are fundamentally changing the metaphors in use so assumptions
 can
  come unglued or be introduced pretty easily.
 
  One example of this is the fact that ZK watches *don't* fire for every
  change but when you write listener oriented code, you kind of expect
 that
  they will.  That makes it really, really easy to introduce that
 assumption
  in the heads of the programmer using the event listener library on top
 of
  ZK.  Another example is how the atomic get content/set watch call works
 in
  ZK is easy to violate in an event driven architecture because the
 thread
  that watches ZK probably resets the watch.  If you assume that the
 listener
  will read the data, then you have introduced a timing mismatch between
 the
  read of the data and the resetting of the watch.  That might be OK or
 it
  might not be.  The point is that these changes are subtle and tricky to
 get
  exactly right.
 
  On Tue, May 4, 2010 at 1:48 PM, Jonathan Holloway 
  jonathan.hollo...@gmail.com wrote:
 
  Is there any reason why this isn't part of the Zookeeper trunk
 already?
 
 
 
 




Re: Misbehaving zk servers

2010-04-29 Thread Travis Crawford
On Thu, Apr 29, 2010 at 9:49 AM, Patrick Hunt ph...@apache.org wrote:
 Is there any good (simple/fast/bulletproof) way to monitor the FD use inside
 the jvm? If so we could stop accepting new client connections once we get
 close to the os imposed limit... The test would have to be a bulletproof one
 though - we wouldn't want to end up in some worse situation (where we refuse
 connection because we mistakenly believe that the limit has been reached).

 Might be good to open a JIRA for this and add some tests. In particular we
 should verify the server handles this as gracefully as it can when the limit
 has been reached.

Poking around with jconsole I found two stats that already measure FDs:

- java.lang.OperatingSystem.MaxFileDescriptorCount
- java.lang.OperatingSystem.OpenFileDescriptorCount

They're described (rather tersely) at:

http://java.sun.com/javase/6/docs/jre/api/management/extension/com/sun/management/UnixOperatingSystemMXBean.html

So it sounds like the feature request would be stop accepting new
client connections if OpenFileDescriptorCount  95% of
MaxFileDescriptorCount? Only start accepting new requests when
OpenFileDescriptorCount  90% of MaxFileDescriptorCount. Basically the
high/low watermark thing.

Thoughts?

--travis





 Patrick

 On 04/29/2010 09:34 AM, Mahadev Konar wrote:

 Hi Travis,

  How many clients did you have connected to this server? Usually the
 default
 is 8K file descriptors. Did you have clients more than that?

 Also, if clients fail to attach to a server, they will run off to another
 server. We do not do any blacklisting because we expect the server to heal
 and if it does not, it mostly shuts itself down in most of the cases.

 Thanks
 mahadev


 On 4/29/10 12:08 AM, Travis Crawfordtraviscrawf...@gmail.com  wrote:

 Hey zookeeper gurus -

 We recently had a zookeeper outage when one ZK server was started with
 a low limit after upgrading to 3.3.0. Several days later the outage
 occurred when that node reached its file descriptor limit and clients
 started having major issues.

 Are there any circumstances when a ZK server will get blacklisted from
 the ensemble? Something similar to how tasktrackers are blacklisted
 when too many tasks fail.

 Thanks!
 Travis




Re: python client structure

2010-04-21 Thread Travis Crawford
On Wed, Apr 21, 2010 at 12:26 AM, Henry Robinson he...@cloudera.com wrote:

 Hi Travis -

 Great to see zkpython getting used. I'm glad you're finding the problems
 with the documentation - please do file JIRAs with anything you'd like to
 see improved (and I know there's a lot to improve with zkpython).


Yeah I was pretty excited to see the python bindings. Thanks!

The best place for help I was able to find is ``help(zookeeper)`` --
is there somewhere else I should be looking instead? It looks like you
wrapped the c client, so I also have been looking in zookeeper.h and
trying to infer what's going on.


 You are using the asynchronous form of get_children. This means that
 ZooKeeper can send you two notifications. The first is called when the
 get_children call completes. The second is the watcher and is called when
 the children of the watched node change. You can omit the watcher if you
 don't need it, or alternatively use the synchronous form which is written
 get_children. This call doesn't return until the operation is complete, so
 you don't need to worry about a callback.


Ok sounds like that general structure will work then. Thanks for verifying.


 The first argument to any watcher or callback is the handle of the client
 that placed the callback. Not the return code! We pass that in so that it's
 easy to make further ZK calls because the handle is readily available. The
 second argument for a callback is the return code, and that  can be mapped
 to a string via zerror(rc) if needed (but as you have found, there are
 numeric return code constants in the module that have readable symbolic
 names).


Aah first argument is the handle! That makes sense.


 Does this help at all? Let me know if you have any follow on questions.

Very much so! If I cleanup the sample code below would you want to put
this on a wiki or check in as an example? Would have been nice if
someone already figured this out when I started messing with things :)

--travis



 cheers,
 Henry

 On 20 April 2010 23:33, Travis Crawford traviscrawf...@gmail.com wrote:

  Hey zookeeper gurus -
 
  I'm getting started with Zookeeper and the python client and an curious if
  I'm structuring watches correctly. I'd like to watch a znode and do stuff
  when its children change. Something doesn't feel right about having two
  methods: one to handle the actual get children call, and one to handle the
  watch.
 
  Does this seem like the right direction? If not, any suggestions on how to
  better structure things?
 
 
  #!/usr/bin/python
 
  import signal
  import threading
  import zookeeper
 
  import logging
  logger = logging.getLogger()
 
  from optparse import OptionParser
  options = None
  args = None
 
 
  class ZKTest(threading.Thread):
   zparent = '/home/travis/zktest'
 
   def __init__(self):
     threading.Thread.__init__(self)
     if options.verbose:
       zookeeper.set_debug_level(zookeeper.LOG_LEVEL_DEBUG)
     self.zh = zookeeper.init(options.servers)
     zookeeper.aget_children(self.zh, self.zparent, self.watcher,
  self.handler)
 
   def __del__(self):
     zookeeper.close(self.zh)
 
   def handler(self, rc, rc1, children):
     Handle zookeeper.aget_children() responses.
 
     Args:
       Arguments are not documented well and I'm not entirely sure what to
       call these. ``rc`` appears to be the response code, such as OK.
       However, the only possible mapping of 0 is OK, so in successful cases
       there appear to be two response codes. The example with no children
       returned ``rc1`` of -7 which maps to OPERATIONTIMEOUT so that appears
       to be an error code, but its not clear what was OK in that case.
 
       If anyone figures this out I would love to know.
 
     Example args:
       'args': (0, 0, ['a', 'b'])
       'args': (0, -7, [])
 
     Does not provide a return value.
     
     logger.debug('Processing response: (%d, %d, %s)' % (rc, rc1, children))
     if (zookeeper.OK == rc and zookeeper.OK == rc1):
       logger.debug('Do the actual work here.')
     else:
       logger.debug('Error getting children! Retrying.')
       zookeeper.aget_children(self.zh, self.zparent, self.watcher,
  self.handler)
 
   def watcher(self, rc, event, state, path):
     Handle zookeeper.aget_children() watches.
 
     This code is called when an child znode changes and triggers a child
     watch. It is not called to handle the aget_children call itself.
 
     Numeric arguments map to constants. See ``DATA`` in ``help(zookeeper)``
     for more information.
 
     Args:
       rc Return code.
       event Event that caused the watch (often called ``type`` elsewhere).
       stats Connection state.
       path Znode that triggered this watch.
 
     Does not provide a return value.
     
     logger.debug('Child watch: (%d, %d, %d, %s)' % (rc, event, state, path))
     zookeeper.aget_children(self.zh, self.zparent, self.watcher,
  self.handler)
 
   def run(self):
     while True:
       pass
 
 
  def

Re: Recovery issue - how to debug?

2010-04-19 Thread Travis Crawford
On Mon, Apr 19, 2010 at 2:15 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 Can you attach the screen shot to the JIRA issue?  The mailing list strips
 these things.

Oops. Updated jira:

https://issues.apache.org/jira/browse/ZOOKEEPER-744

--travis



 On Mon, Apr 19, 2010 at 1:18 PM, Travis Crawford
 traviscrawf...@gmail.comwrote:

 Filed:

    https://issues.apache.org/jira/browse/ZOOKEEPER-744

 Attached is a screenshot of some JMX output in Ganglia - its currently
 implemented using a -javaagent tool I happened to find. Having a
 simple non-java way to fetch monitoring stats and publish to an
 external monitoring system would be awesome, and probably reusable by
 others.




Re: monitoring zookeeper

2010-04-14 Thread Travis Crawford
Hey Kishore -

Thanks for the info. I found an interesting library called jmetric (
http://code.google.com/p/jmxetric) that reads MBeans and publishes their
contents to Ganglia and its working pretty well. A simplified config looks
like:

jmxetric-config
  jvm process=Zookeeper/
  sample delay=60
mbean
name=org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=Leader
pname=ZK
  attribute name=AvgRequestLatency type=double/
  attribute name=MaxRequestLatency type=double/
  attribute name=MinRequestLatency type=double/
  attribute name=OutstandingRequests type=double/
  attribute name=PacketsReceived type=double/
  attribute name=PacketsSent type=double/
/mbean
  /sample

It doesn't solve the nested property issue, unfortunately, so I may have to
flatten some statistics as you have. I'm interested in checking out your
code if you don't mind.


At a higher level, I'm interested in setting up the sort of monitoring one
would expect of a critical datacenter service. To start with, I'd like to
collect data necessary to:

- page when there's no leader
- page when minimum number of replicas to reach quorum are present
- email when replicas are missing, but still above quorum minimum.

For example, send an email when 1/5 are down, and page when 2/5 are down.
Also page if there's no leader for some other reason. The operational
metrics like latencies, connections, requests would be useful in
troubleshooting issues as well as capacity planning.

--travis




On Wed, Apr 14, 2010 at 4:50 PM, kishore g g.kish...@gmail.com wrote:

 Hi Travis,

 We do monitor zookeeper using JMX. We have a simple code which does the
 following

   - parse JMX output and convert the output into key value format. The
   nested properties are flattened.
   - Emit the key values using LWES[ http://www.lwes.org/] Api's at regular
   interval[configurable]
   - The keys to be emitted can be configured via config file.

 We have our own internal reporting framework which displays these metrics.
 In order to differentiate between leader and follower we use separate keys
 to

 ReplicatedServer_idXXX_replica.XXX_Follower.AvgRequestLatency=rsf_mrl
 ReplicatedServer_idXXX_replica.XXX_Leader.AvgRequestLatency=rsl_mrl

 If the server is leader then rsf_mrl will be empty and vice versa. I can
 provide the code to do this and you can probably change it to meet your
 needs and enhance it to work for Ganglia. Let me know if this helps you.

 thanks,
 Kishore G

 On Wed, Apr 14, 2010 at 11:12 AM, Travis Crawford
 traviscrawf...@gmail.comwrote:

  Hey zookeeper gurus -
 
  Are there any recommended ways for one to monitor zookeeper ensembles?
 I'm
  familiar with the four-letter words and that stats are published via JMX
 -
  I'm more interested in what people are doing with those stats.
 
  I'd like to publish the JMX stats to Ganglia, and this works well for the
  built-in stats. However, the zookeeper-specific names appear to be
 dynamic
  which causes issues when deciding what to publish. For example, the
 current
  mode (leader/follower) appears to only be accessible from the bean names,
  instead of looking at, say, a mode stat.
 
 
 
 org.apache.ZooKeeperService:name0=ReplicatedServer_id1,name1=replica.1,name2=Follower
 
 
 org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=replica.2,name2=Leader
 
 
  The only way I've found to learn if replicas are up-to-date is looking at
  synced buried in followerInfo:
 
  $ java -jar cmdline-jmxclient-0.10.5.jar - localhost:8081
 
 
 org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=replica.2,name2=Leader
  followerInfo
  04/14/2010 18:06:06 + org.archive.jmx.Client followerInfo:
  FollowerHandler Socket[addr=/10.0.0.10,port=48104,localport=2888]
  tickOfLastAck:29793 synced?:true queuedPacketLength:0
  FollowerHandler Socket[addr=/10.0.0.11,port=59599,localport=2888]
  tickOfLastAck:29793 synced?:true queuedPacketLength:0
 
 
  I don't mind writing a tool to parse the JMX output and publishing to
  Ganglia if needed, but it seems like a problem that may have already been
  solved and I'm curious what others are doing. The tool would basically
 take
  the zookeeper stats, normalize the names, and publish to a timeseries
  database.
 
  Is anyone already monitoring ZK in a way others might find useful?
 
  Thanks!
  Travis