Rafał Sumisławski created KAFKA-18615:
-----------------------------------------
Summary: StreamThread *-ratio metrics suffer from sampling bias
Key: KAFKA-18615
URL: https://issues.apache.org/jira/browse/KAFKA-18615
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 3.8.1
Reporter: Rafał Sumisławski
h2. Background
{{StreamThread}} defines {{{}commit-ratio{}}}, {{{}poll-ratio{}}},
{{{}process-ratio{}}}, {{punctuate-ratio}} metrics here:
[https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/metrics/ThreadMetrics.java#L230-L288]
These metrics indicate "The fraction of time the thread spent on \{action}".
Unlike DefaultStateUpdater's ratio metrics, these metrics are "value" sensors.
Meaning that the observable value of the metric is simply the last value
registered before the act of observing the metric. This seems to avoid the
ratio averaging issues I described in KAFKA-18369, but...
h2. Issue
Let's analyse an example scenario.
For simplicity I will ignore existence of {{poll-ratio}} and
{{{}punctuate-ratio{}}}, and just consider {{{}commit-ratio{}}}, and
{{{}process-ratio{}}}.
Let's say an external observer, be it a human reading JMX metrics, or an
automated metric scraping solution, reads the metrics at a random point in
time, uncorrelated with the inner workings of the kafka-streams application.
The application itself works under a steady workload and the stream thread does
1000 iterations every 10 seconds, of which:
* 999 iterations execute only {{process}} taking {{{}1ms{}}}, resulting in
{{commit-ratio=0}}
* 1 iteration executes {{process}} taking {{1ms}} and {{commit}} taking
{{{}9000ms{}}}, resulting in {{commit-ratio=0.9998889012}}
In no specific order. But we will assume that two committing iterations never
happen one after another (the issue still exists without this assumption, the
math just gets harder. Also the assumption is realistic given how kafka streams
works).
The ratio metrics are always updated at the end of an iteration. Therefore
metric values corresponding to iteration number I, are visible for the duration
of iteration number I+1. In that 10s period, there's only 1ms during which the
{{commit-ratio=0.9998889012}} can be observed, as 1ms later one of the short
iterations completes and overwrites the metric values. During the remaining
9999ms a {{commit-ratio=0}} would be observed. Therefore our random observer
has 99.99% probability of observing a {{{}commit-ratio=0{}}}, even though the
{{{}StreamThread{}}}, spends 90% of its time on {{commit}}
h2. Solution
This ticket is a sibling of KAFKA-18369 I wanted to report it as a separate
ticket as these are different metrics, with currently different implementation,
affected by a different problem that needs a separate explanation. But in my
opinion the ratio metrics of {{StreamThread}} and {{DefaultStateUpdate}} should
work, and be implemented the same way, so when it comes to a solution I will
just refer to the ongoing discussion in KAFKA-18369.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)