The Data Lorax created KAFKA-3456: ------------------------------------- Summary: In-house KafkaMetric misreports metrics when periodically observed Key: KAFKA-3456 URL: https://issues.apache.org/jira/browse/KAFKA-3456 Project: Kafka Issue Type: Bug Components: consumer, core, producer Affects Versions: 0.9.0.1, 0.9.0.0, 0.10.0.0 Reporter: The Data Lorax Assignee: Neha Narkhede Priority: Minor
The metrics captured by Kafka through the in-house {{SampledStat}} suffer from misreporting metrics if observed in a periodic manner. Consider a {{Rate}} metric that is using the default 2 samples and 30 second sample window i.e. the {{Rate}} is capturing 60 seconds worth of data. So, to report this metric to some external system we might poll it every 60 seconds to observe the current value. Using a shorter period would, in the case of a {{Rate}}, lead to smoothing of the plotted data, and worse, in the case of a {{Count}}, would lead to double counting - so 60 seconds is the only period at which we can poll the metrics if we are to report accurate metrics. To demonstrate the issue consider the following somewhat extreme case: The {{Rate}} is capturing data from a system which alternates between a 999 per sec rate and a 1 per sec rate every 30 seconds, with the different rates aligned with the sample boundaries within the {{Rate}} instance i.e. after 60 seconds the first sample within the {{Rate}} instance will have a rate of 999 per sec, and the second 1 per sec. If we were to as the metric for its value at this 60 second boundary it would correctly report 500 per sec. However, if we asked it again 1 millisecond later it would report 1 per sec, as the first sample window has been aged out. Depending on how retarded into the 60 sec period of the metric our periodic poll of the metric was, we would observe a constant rate somewhere the range of 1 to 500 per second, most likely around the 250 mark. Other metrics based off of the {{SampledStat}} type suffer from the same issue e.g. the {{Count}} metric, given a constant rate of 1 per second, will report a constant count somewhere between 30 and 60, rather than the correct 60. -- This message was sent by Atlassian JIRA (v6.3.4#6332)