[ https://issues.apache.org/jira/browse/HADOOP-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245274#comment-16245274 ]
Xiao Chen commented on HADOOP-14960: ------------------------------------ Thanks Misha for the new changes by incorporating the timestamp and gctime into a class, and using a {{GcData}} class to handle update atomicity, looks pretty good! Please fix the checkstyle warnings. While you're at it, I have a few minor comments :) - Can we also {{setName}} on the {{GcTimeMonitor}} class, for better debuggability? - Let's add a precondition check on {{bufSize}} too, to make sure we don't allocate crazy sizes here (say, 1M?) - trivial Javadoc comments: {{put a limit on a number of GCTimeMonitor instances}} s/a number/the number/g {{@param observationWindowMs a period until now, over which the percentage}} s/a period until now, over which/the interval over which/ - We usually use javadoc comment style on the ASF license class header. Could you update {{GcTimeMonitor}}'s first line from {{/\*}} to {{/\*\*}}? > Add GC time percentage monitor/alerter > -------------------------------------- > > Key: HADOOP-14960 > URL: https://issues.apache.org/jira/browse/HADOOP-14960 > Project: Hadoop Common > Issue Type: Improvement > Reporter: Misha Dmitriev > Assignee: Misha Dmitriev > Attachments: HADOOP-14960.01.patch, HADOOP-14960.02.patch, > HADOOP-14960.03.patch > > > Currently class {{org.apache.hadoop.metrics2.source.JvmMetrics}} provides > several metrics related to GC. Unfortunately, all these metrics are not as > useful as they could be, because they don't answer the first and most > important question related to GC and JVM health: what percentage of time my > JVM is paused in GC? This percentage, calculated as the sum of the GC pauses > over some period, like 1 minute, divided by that period - is the most > convenient measure of the GC health because: > - it is just one number, and it's clear that, say, 1..5% is good, but 80..90% > is really bad > - it allows for easy apple-to-apple comparison between runs, even between > different apps > - when this metric reaches some critical value like 70%, it almost always > indicates a "GC death spiral", from which the app can recover only if it > drops some task(s) etc. > The existing "total GC time", "total number of GCs" etc. metrics only give > numbers that can be used to rougly estimate this percentage. Thus it is > suggested to add a new metric to this class, and possibly allow users to > register handlers that will be automatically invoked if this metric reaches > the specified threshold. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org