Hi I just wrote ZooKeeper monitoring for the SaaS monitoring company I work for, at the request of one of our customers (sales-y brief view of the ZooKeeper monitoring at http://www.logicmonitor.com/monitoring/applications/java-monitoring/zookeeper-monitoring/). (Feel free to contact me directly if anyone is interested in anything specific.)
I have a few suggestions as to how the exposed JMX objects could be improved; - instead of reporting average and max latency, which, so far as I can tell from the source code, seems to be since server start (or the Mbean to reset the stats is triggered), do the same as Tomcat, and other projects: i.e. report the total processing time as one counter, and also report the total number of requests processed. Then if you want to calculate the average latency since server start, it's easy, but more interesting its also easy to calculate the average latency for any time period (such as the last minute - sample total requests and latency at start and end of minute, subtract, divide, and there you go.) This lets you graph and alert on latencies in a meaningful way. - Having the Mbean name change as to whether the server is Leader or Follower is odd. First time I've seen that in any JMX app (we do a lot more than we list on our website.) That took a bit of thought as to how to get consistent graphs regardless of the role the server is in. That probably presents a block to many other monitoring systems, so may want to be changed at some point. - Exposing things like "synced" as an operation, rather than an attribute, also seems odd. It would be nice if that was a simple attribute. And finally - any chance someone can explain the "pendingRevalidationCount"? I couldn't figure that one out enough to understand it's significance. Thanks Steve
