[
https://issues.apache.org/jira/browse/KAFKA-19678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18029605#comment-18029605
]
Steven Schlansker commented on KAFKA-19678:
-------------------------------------------
Thanks for the context, this does sound tricky :(
Unfortunately, some degenerate groups can have upwards of 0.5M entries (of at
least 16 bytes each), so I'm concerned the list approach would quickly run into
maximum-record-size problems, as well as expensive serialization and
deserialization costs.
For now, we run a patched kafka client which intentionally leaks these metrics,
which is far from a long term solution but at least keeps us running at the
moment.
> Streams open iterator tracking has high contention on metrics lock
> ------------------------------------------------------------------
>
> Key: KAFKA-19678
> URL: https://issues.apache.org/jira/browse/KAFKA-19678
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 4.1.0
> Reporter: Steven Schlansker
> Priority: Major
> Attachments: image-2025-09-05-12-13-24-910.png
>
>
> We run Kafka Streams 4.1.0 with custom processors that heavily use state
> store range iterators.
> While investigating disappointing performance, we found a surprising source
> of lock contention.
> Over the course of about a 1 minute profiler sample, the
> {{org.apache.kafka.common.metrics.Metrics}} lock is taken approximately
> 40,000 times and blocks threads for about 1 minute.
> This appears to be because our state stores generally have no iterators open,
> except when their processor is processing a record, in which case it opens an
> iterator (taking the lock through {{OpenIterators.add}} into
> {{{}Metrics.registerMetric{}}}), does a tiny bit of work, and then closes the
> iterator (again taking the lock through {{OpenIterators.remove}} into
> {{{}Metrics.removeMetric{}}}).
> So, stream processing threads takes a globally shared lock twice per record,
> for this subset of our data. I've attached a profiler thread state
> visualization with our findings - the red bar indicates the thread was
> blocked during the sample on this lock. As you can see, this lock seems to be
> severely hampering our performance.
>
> !image-2025-09-05-12-13-24-910.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)