[jira] [Commented] (ZOOKEEPER-2424) Detect and log possible GC churn in servers.

Chris Nauroth (JIRA) Mon, 09 May 2016 10:49:42 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276707#comment-15276707
 ]


Chris Nauroth commented on ZOOKEEPER-2424:
------------------------------------------

This issue was inspired by an outage report from Elastic Cloud.

https://www.elastic.co/blog/elastic-cloud-outage-april-2016

I suggest that we implement something like Hadoop's JvmPauseMonitor, which has 
been very effective for diagnosis of NameNode JVM misconfigurations that cause 
long GC pauses.

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java

Detecting long GC pauses and logging them into the main ZooKeeper log would 
make the root cause more visible.

Additionally, it would be nice to get relevant metrics about long GC pauses via 
HTTP calls to the Jetty admin server added in ZooKeeper 3.5.

> Detect and log possible GC churn in servers.
> --------------------------------------------
>
>                 Key: ZOOKEEPER-2424
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2424
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>            Reporter: Chris Nauroth
>             Fix For: 3.5.3
>
>
> Excessive JVM garbage collection pauses can harm the stability of a ZooKeeper 
> ensemble.  If a stop-the-world GC pause in a server lasts long enough, then 
> the the node will drop out of the ensemble.  If this happens on multiple 
> nodes simultaneously, then there is a risk of loss of quorum.  This issue 
> proposes to detect long GC pauses, log warnings about them, and expose 
> metrics about them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2424) Detect and log possible GC churn in servers.

Reply via email to