[
https://issues.apache.org/jira/browse/ZOOKEEPER-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15276707#comment-15276707
]
Chris Nauroth commented on ZOOKEEPER-2424:
------------------------------------------
This issue was inspired by an outage report from Elastic Cloud.
https://www.elastic.co/blog/elastic-cloud-outage-april-2016
I suggest that we implement something like Hadoop's JvmPauseMonitor, which has
been very effective for diagnosis of NameNode JVM misconfigurations that cause
long GC pauses.
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java
Detecting long GC pauses and logging them into the main ZooKeeper log would
make the root cause more visible.
Additionally, it would be nice to get relevant metrics about long GC pauses via
HTTP calls to the Jetty admin server added in ZooKeeper 3.5.
> Detect and log possible GC churn in servers.
> --------------------------------------------
>
> Key: ZOOKEEPER-2424
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2424
> Project: ZooKeeper
> Issue Type: Improvement
> Components: server
> Reporter: Chris Nauroth
> Fix For: 3.5.3
>
>
> Excessive JVM garbage collection pauses can harm the stability of a ZooKeeper
> ensemble. If a stop-the-world GC pause in a server lasts long enough, then
> the the node will drop out of the ensemble. If this happens on multiple
> nodes simultaneously, then there is a risk of loss of quorum. This issue
> proposes to detect long GC pauses, log warnings about them, and expose
> metrics about them.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)