[ https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Konstantin Shvachko reassigned HADOOP-14989: -------------------------------------------- Assignee: (was: Erik Krogen) > metrics2 JMX cache refresh result in inconsistent Mutable(Stat|Rate) values > --------------------------------------------------------------------------- > > Key: HADOOP-14989 > URL: https://issues.apache.org/jira/browse/HADOOP-14989 > Project: Hadoop Common > Issue Type: Bug > Components: metrics > Affects Versions: 2.6.5 > Reporter: Erik Krogen > Priority: Critical > Attachments: HADOOP-14989.test.patch > > > While doing some digging in the metrics2 system recently, we noticed that the > way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it > is based off of {{MutableStat}}) mean that every time the value is > snapshotted, all previous information is lost. So every time a JMX cache > refresh occurs, it resets the {{MutableStat}}, meaning that all configured > metrics sinks do not consider the previous statistics in their emitted > values. The same behavior is true if you configured multiple sink periods. > {{MutableStat}}, to compute its average value, maintains a total value since > last snapshot, as well as operation count since last snapshot. Upon > snapshotting, the average is calculated as (total / opCount) and placed into > a gauge metric, and total / operation count are cleared. So the average value > represents the average since the last snapshot. If we have only a single sink > period ever snapshotting, this would result in the expected behavior that the > value is the average over the reporting period. However, if multiple sink > periods are configured, or if the JMX cache is refreshed, this is another > snapshot operation. So, for example, if you have a FileSink configured at a > 60 second interval and your JMX cache refreshes itself 1 second before the > FileSink period fires, the values emitted to your FileSink only represent > averages _over the last one second_. > A few ways to solve this issue: > * Make {{MutableRate}} manage its own average refresh, similar to > {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the > last quantile values that it will serve up until the next refresh. Given how > many {{MutableRate}} metrics there are, a thread per metric is not really > feasible, but could be done on e.g. a per-source basis. This has some > downsides: if multiple sinks are configured with different periods, what is > the right refresh period for the {{MutableRate}}? > * Make {{MutableRate}} emit two counters, one for total and one for operation > count, rather than an average gauge and an operation count counter. The > average could then be calculated downstream from this information. This is > cumbersome for operators and not backwards compatible. To improve on both of > those downsides, we could have it keep the current behavior but > _additionally_ emit the total as a counter. The snapshotted average is > probably sufficient in the common case (we've been using it for years), and > when more guaranteed accuracy is required, the average could be derived from > the total and operation count. > The two above suggestions will fix this for both JMX and multiple sink > periods, but may be overkill. Multiple sink periods are probably not > necessary though we should at least document the behavior. > Open to suggestions & input here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org