[jira] [Created] (KAFKA-18615) StreamThread *-ratio metrics suffer from sampling bias

Jira Tue, 21 Jan 2025 09:51:19 -0800

Rafał Sumisławski created KAFKA-18615:
-----------------------------------------


             Summary: StreamThread *-ratio metrics suffer from sampling bias
                 Key: KAFKA-18615
                 URL: https://issues.apache.org/jira/browse/KAFKA-18615
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 3.8.1
            Reporter: Rafał Sumisławski


h2. Background

{{StreamThread}} defines {{{}commit-ratio{}}}, {{{}poll-ratio{}}}, 
{{{}process-ratio{}}}, {{punctuate-ratio}} metrics here: 
[https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/metrics/ThreadMetrics.java#L230-L288]
 These metrics indicate "The fraction of time the thread spent on \{action}". 
Unlike DefaultStateUpdater's ratio metrics, these metrics are "value" sensors. 
Meaning that the observable value of the metric is simply the last value 
registered before the act of observing the metric. This seems to avoid the 
ratio averaging issues I described in KAFKA-18369, but...
h2. Issue

Let's analyse an example scenario.

For simplicity I will ignore existence of {{poll-ratio}} and 
{{{}punctuate-ratio{}}}, and just consider {{{}commit-ratio{}}}, and 
{{{}process-ratio{}}}.

Let's say an external observer, be it a human reading JMX metrics, or an 
automated metric scraping solution, reads the metrics at a random point in 
time, uncorrelated with the inner workings of the kafka-streams application.

The application itself works under a steady workload and the stream thread does 
1000 iterations every 10 seconds, of which:
 * 999 iterations execute only {{process}} taking {{{}1ms{}}}, resulting in 
{{commit-ratio=0}}
 * 1 iteration executes {{process}} taking {{1ms}} and {{commit}} taking 
{{{}9000ms{}}}, resulting in {{commit-ratio=0.9998889012}}

In no specific order. But we will assume that two committing iterations never 
happen one after another (the issue still exists without this assumption, the 
math just gets harder. Also the assumption is realistic given how kafka streams 
works).

The ratio metrics are always updated at the end of an iteration. Therefore 
metric values corresponding to iteration number I, are visible for the duration 
of iteration number I+1. In that 10s period, there's only 1ms during which the 
{{commit-ratio=0.9998889012}} can be observed, as 1ms later one of the short 
iterations completes and overwrites the metric values. During the remaining 
9999ms a {{commit-ratio=0}} would be observed. Therefore our random observer 
has 99.99% probability of observing a {{{}commit-ratio=0{}}}, even though the 
{{{}StreamThread{}}}, spends 90% of its time on {{commit}}
h2. Solution

This ticket is a sibling of KAFKA-18369 I wanted to report it as a separate 
ticket as these are different metrics, with currently different implementation, 
affected by a different problem that needs a separate explanation. But in my 
opinion the ratio metrics of {{StreamThread}} and {{DefaultStateUpdate}} should 
work, and be implemented the same way, so when it comes to a solution I will 
just refer to the ongoing discussion in KAFKA-18369.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-18615) StreamThread *-ratio metrics suffer from sampling bias

Reply via email to