[ 
https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko reassigned HADOOP-14989:
--------------------------------------------

    Assignee:     (was: Erik Krogen)

> metrics2 JMX cache refresh result in inconsistent Mutable(Stat|Rate) values
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-14989
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14989
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: metrics
>    Affects Versions: 2.6.5
>            Reporter: Erik Krogen
>            Priority: Critical
>         Attachments: HADOOP-14989.test.patch
>
>
> While doing some digging in the metrics2 system recently, we noticed that the 
> way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it 
> is based off of {{MutableStat}}) mean that every time the value is 
> snapshotted, all previous information is lost. So every time a JMX cache 
> refresh occurs, it resets the {{MutableStat}}, meaning that all configured 
> metrics sinks do not consider the previous statistics in their emitted 
> values. The same behavior is true if you configured multiple sink periods.
> {{MutableStat}}, to compute its average value, maintains a total value since 
> last snapshot, as well as operation count since last snapshot. Upon 
> snapshotting, the average is calculated as (total / opCount) and placed into 
> a gauge metric, and total / operation count are cleared. So the average value 
> represents the average since the last snapshot. If we have only a single sink 
> period ever snapshotting, this would result in the expected behavior that the 
> value is the average over the reporting period. However, if multiple sink 
> periods are configured, or if the JMX cache is refreshed, this is another 
> snapshot operation. So, for example, if you have a FileSink configured at a 
> 60 second interval and your JMX cache refreshes itself 1 second before the 
> FileSink period fires, the values emitted to your FileSink only represent 
> averages _over the last one second_.
> A few ways to solve this issue:
> * Make {{MutableRate}} manage its own average refresh, similar to 
> {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the 
> last quantile values that it will serve up until the next refresh. Given how 
> many {{MutableRate}} metrics there are, a thread per metric is not really 
> feasible, but could be done on e.g. a per-source basis. This has some 
> downsides: if multiple sinks are configured with different periods, what is 
> the right refresh period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation 
> count, rather than an average gauge and an operation count counter. The 
> average could then be calculated downstream from this information. This is 
> cumbersome for operators and not backwards compatible. To improve on both of 
> those downsides, we could have it keep the current behavior but 
> _additionally_ emit the total as a counter. The snapshotted average is 
> probably sufficient in the common case (we've been using it for years), and 
> when more guaranteed accuracy is required, the average could be derived from 
> the total and operation count.
> The two above suggestions will fix this for both JMX and multiple sink 
> periods, but may be overkill. Multiple sink periods are probably not 
> necessary though we should at least document the behavior.
> Open to suggestions & input here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to