[ https://issues.apache.org/jira/browse/KAFKA-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16567632#comment-16567632 ]
Sam Lendle commented on KAFKA-7240: ----------------------------------- I believe at least part of the issue is in [StreamsMetricsImpl#addThroughputMetrics|https://github.com/apache/kafka/blob/ee5cc974d2ef449444861d82e1793668184ca86f/streams/src/main/java/org/apache/kafka/streams/processor/internals/metrics/StreamsMetricsImpl.java#L352], which uses Count(). Count() is a SampledStat, so the value it reports is the count in recent time windows, and the value decreases whenever a window is purged I think the fix there would be to use a non-SampledStat version of Count(), as Total() is to Rate.SampledTotal(). > -total metrics in Streams are incorrect > --------------------------------------- > > Key: KAFKA-7240 > URL: https://issues.apache.org/jira/browse/KAFKA-7240 > Project: Kafka > Issue Type: Bug > Components: metrics, streams > Affects Versions: 2.0.0 > Reporter: Sam Lendle > Priority: Major > > I noticed the values of total metrics for streams were decreasing > periodically when viewed in JMX, for example process-total for each > processor-node-id under stream-processor-node-metrics. > Edit: For processor node metrics, I should have been looking at > ProcessorNode, not StreamsMetricsThreadImpl. > -Looking at StreamsMetricsThreadImpl, I believe this behavior is due to > using Count() as the Stat for the *-total metrics. Count() is a SampledStat, > so the value it reports is the count in recent time windows, and the value > decreases whenever a window is purged.- > ---- > -This explains the behavior I saw, but I think the issue is deeper. For > example, processTimeSensor attempts to measure, process-latency-avg, > process-latency-max, process-rate, and process-total. For that sensor, record > is called like- > -streamsMetrics.processTimeSensor.record(computeLatency() / (double) > processed, timerStartedMs);- > -so the value passed to record is average latency per processed message in > this batch if I understand correctly. That gets pushed through to the call to > Count#record, which increments it's count by 1, ignoring the value parameter. > Whatever stat is recording the total would need to know is the number of > messages processed. Because of that, I don't think it's possible for one > Sensor to measure both latency and total.- > -That said, it's not clear to me how all the different Stats work and how > exactly Sensors work, and I don't actually understand how the process-rate > metric is working for similar reasons but that seems to be correct, so I may > be missing something here.- > > cc [~guozhang] -- This message was sent by Atlassian JIRA (v7.6.3#76005)