I filed ZOOKEEPER-2424 to track this. --Chris Nauroth
On 5/9/16, 10:18 AM, "Patrick Hunt" <[email protected]> wrote: >Makes sense to me to add it. Someone could create a ZK jira? Sounds like a >great starter project for someone interested to get rolling with ZK. 3.5+ >adds jetty support for accessing metrics, sounds like it would dovetail >nicely. > >Patrick > >On Mon, May 9, 2016 at 10:12 AM, Chris Nauroth <[email protected]> >wrote: > >> I always sympathize with a major outage report, but on the bright side, >>it >> was very satisfying to hear the ZooKeeper cluster had sustained uptime >>for >> 3 years. That agrees with my own user experience. It's often the most >> stable component of a distributed infrastructure (as it needs to be). >> >> As far as potential improvements, I was wondering if it would make sense >> to introduce something like Hadoop's JvmPauseMonitor [1]. This is a >> background thread that attempts to detect GC churn and log warnings >>about >> it. This has been very helpful in diagnosing NameNode misconfigurations >> that lead to GC churn. >> >> This wouldn't have prevented a problem for the Elastic Cloud team, but >>at >> least it would have made the root cause more visible. A warning about >>GC >> churn could have been shown in the main ZooKeeper log instead of a >> separate GC log or inferring it from other sources like JMX. >> >> [1] https://s.apache.org/4sdx >> >> --Chris Nauroth >> >> >> >> >> On 5/8/16, 7:37 PM, "Patrick Hunt" <[email protected]> wrote: >> >> >Interesting root cause and mitigations discussion. >> > >> >https://www.elastic.co/blog/elastic-cloud-outage-april-2016 >> > >> >Patrick >> >>
