[ 
https://issues.apache.org/jira/browse/KAFKA-19678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029604#comment-18029604
 ] 

Matthias J. Sax commented on KAFKA-19678:
-----------------------------------------

This metric is a little bit tricky... (for context 
[KIP-989|https://cwiki.apache.org/confluence/display/KAFKA/KIP-989%3A+Improved+StateStore+Iterator+metrics+for+detecting+leaks])
 – if we would report `0` (or `-1`), the issue is, that if you setup an alert 
that computes "currentTime minus metricValue" you get false-positives, as the 
iterator open time computation would report a high value (many years). Your 
alert would need to be conditional, what is a struggle as far as I know. While 
a dashboard can render `0` it would blow out your "y-axis" on the dashboard to 
a very high value, too, and it seems it would make it very hard to actually 
read the dashboard?

We actually reported `null` originally, but this also caused issues: 
https://issues.apache.org/jira/browse/KAFKA-17954 – so we decided to 
de-register the metric when it becomes empty.
{quote} otherwise we are sure no new iterators will ever be created.
{quote}
Not sure what you mean by this?

For your use case: how many values per group do you get? Would it be possible 
to do an `aggregation` per group, and compute a `List` over all values per 
group? This would allow you to maintain this list with a key-lookup per update, 
avoiding a range scan (of course, this only works if the list is small enough, 
to avoid too large records...)

> Streams open iterator tracking has high contention on metrics lock
> ------------------------------------------------------------------
>
>                 Key: KAFKA-19678
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19678
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 4.1.0
>            Reporter: Steven Schlansker
>            Priority: Major
>         Attachments: image-2025-09-05-12-13-24-910.png
>
>
> We run Kafka Streams 4.1.0 with custom processors that heavily use state 
> store range iterators.
> While investigating disappointing performance, we found a surprising source 
> of lock contention.
> Over the course of about a 1 minute profiler sample, the 
> {{org.apache.kafka.common.metrics.Metrics}} lock is taken approximately 
> 40,000 times and blocks threads for about 1 minute.
> This appears to be because our state stores generally have no iterators open, 
> except when their processor is processing a record, in which case it opens an 
> iterator (taking the lock through {{OpenIterators.add}} into 
> {{{}Metrics.registerMetric{}}}), does a tiny bit of work, and then closes the 
> iterator (again taking the lock through {{OpenIterators.remove}} into 
> {{{}Metrics.removeMetric{}}}).
> So, stream processing threads takes a globally shared lock twice per record, 
> for this subset of our data. I've attached a profiler thread state 
> visualization with our findings - the red bar indicates the thread was 
> blocked during the sample on this lock. As you can see, this lock seems to be 
> severely hampering our performance.
>  
> !image-2025-09-05-12-13-24-910.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to