[ https://issues.apache.org/jira/browse/KAFKA-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lucas Brutschy updated KAFKA-19529: ----------------------------------- Description: In long-running tests, streams sometimes fails with this exception: {code:java} /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz:[2025-06-27 02:06:22,384] ERROR [i-0285945bbb1aa2330-StreamThread-3] stream-client [i-0285945bbb1aa2330] Encountered the following exception during processing and sent shutdown request for the entire application. (org.apache.kafka.streams.KafkaStreams) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz-java.lang.IllegalArgumentException: A metric named 'MetricName [name=restore-records-rate, group=stream-state-updater-metrics, description=The average per-second number of records restored, tags={thread-id=i-0285945bbb1aa2330-StateUpdater-3}]' already exists, can't register another one. /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:344) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:317) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.streams.processor.internals.DefaultStateUpdater$StateUpdaterMetrics.<init>(DefaultStateUpdater.java:1121) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.streams.processor.internals.DefaultStateUpdater$StateUpdaterThread.<init>(DefaultStateUpdater.java:96) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.streams.processor.internals.DefaultStateUpdater.start(DefaultStateUpdater.java:794) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.streams.processor.internals.TaskManager.init(TaskManager.java:154) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:885) {code} * This seems to happen, when we restart a stream thread, e.g. the stream thread is removed at some point and then restarted at a later point. * We see log lines before / after the code that removes metrics from a state updater thread * Every time, it is a different state updater metric that seems to crash, so it seems to be some kind of race condition * I found a recent change in state updater initialization: [https://github.com/apache/kafka/pull/19889] . I have digged into it, and also asked Matthias/Bruno to dig into it, but we couldn’t find a reason why this could cause any problem. * What Matthias found out is that the problem seems to be on the removal-side. I.e., the metric indeed doesn't seem to be removed correctly when stopping the state updater. We have reverted PR #19889 in 4.1 and this seems to have resolved the problem there. The problem still persists on trunk. We should either revert #19889 on trunk as well, or fix the PR. cc [~cadonna] [~bbejeck] was: In long-running tests, streams sometimes fails with this exception: {code:java} /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz:[2025-06-27 02:06:22,384] ERROR [i-0285945bbb1aa2330-StreamThread-3] stream-client [i-0285945bbb1aa2330] Encountered the following exception during processing and sent shutdown request for the entire application. (org.apache.kafka.streams.KafkaStreams) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz-java.lang.IllegalArgumentException: A metric named 'MetricName [name=restore-records-rate, group=stream-state-updater-metrics, description=The average per-second number of records restored, tags={thread-id=i-0285945bbb1aa2330-StateUpdater-3}]' already exists, can't register another one. /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:344) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:317) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.streams.processor.internals.DefaultStateUpdater$StateUpdaterMetrics.<init>(DefaultStateUpdater.java:1121) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.streams.processor.internals.DefaultStateUpdater$StateUpdaterThread.<init>(DefaultStateUpdater.java:96) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.streams.processor.internals.DefaultStateUpdater.start(DefaultStateUpdater.java:794) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.streams.processor.internals.TaskManager.init(TaskManager.java:154) /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:885) {code} * This seems to happen, when we restart a stream thread, e.g. the stream thread is removed at some point and then restarted at a later point. * We see log lines before / after the code that removes metrics from a state updater thread * Every time, it is a different state updater metric that seems to crash, so it seems to be some kind of race condition * I found a recent change in state updater initialization: [https://github.com/apache/kafka/pull/19889] . I have digged into it, and also asked Matthias/Bruno to dig into it, but we couldn’t find a reason why this could cause any problem. We have reverted PR #19889 in 4.1 and this seems to have resolved the problem there. The problem still persists on trunk. We should either revert #19889 on trunk as well, or fix the PR. cc [~cadonna] [~bbejeck] > State updater does not always seem to remove all metrics > -------------------------------------------------------- > > Key: KAFKA-19529 > URL: https://issues.apache.org/jira/browse/KAFKA-19529 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 4.2.0 > Reporter: Lucas Brutschy > Priority: Major > > In long-running tests, streams sometimes fails with this exception: > {code:java} > /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz:[2025-06-27 > 02:06:22,384] ERROR [i-0285945bbb1aa2330-StreamThread-3] stream-client > [i-0285945bbb1aa2330] Encountered the following exception during processing > and sent shutdown request for the entire application. > (org.apache.kafka.streams.KafkaStreams) > /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz-java.lang.IllegalArgumentException: > A metric named 'MetricName [name=restore-records-rate, > group=stream-state-updater-metrics, description=The average per-second number > of records restored, tags={thread-id=i-0285945bbb1aa2330-StateUpdater-3}]' > already exists, can't register another one. > /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at > org.apache.kafka.common.metrics.Sensor.add(Sensor.java:344) > /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at > org.apache.kafka.common.metrics.Sensor.add(Sensor.java:317) > /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at > org.apache.kafka.streams.processor.internals.DefaultStateUpdater$StateUpdaterMetrics.<init>(DefaultStateUpdater.java:1121) > /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at > org.apache.kafka.streams.processor.internals.DefaultStateUpdater$StateUpdaterThread.<init>(DefaultStateUpdater.java:96) > /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at > org.apache.kafka.streams.processor.internals.DefaultStateUpdater.start(DefaultStateUpdater.java:794) > /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at > org.apache.kafka.streams.processor.internals.TaskManager.init(TaskManager.java:154) > /mnt/run/streams/logs/streams.log.2025-06-27-02:59.gz- at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:885) > {code} > > * This seems to happen, when we restart a stream thread, e.g. the stream > thread is removed at some point and then restarted at a later point. > * We see log lines before / after the code that removes metrics from a state > updater thread > * Every time, it is a different state updater metric that seems to crash, so > it seems to be some kind of race condition > * I found a recent change in state updater initialization: > [https://github.com/apache/kafka/pull/19889] . I have digged into it, and > also asked Matthias/Bruno to dig into it, but we couldn’t find a reason why > this could cause any problem. > * What Matthias found out is that the problem seems to be on the > removal-side. I.e., the metric indeed doesn't seem to be removed correctly > when stopping the state updater. > We have reverted PR #19889 in 4.1 and this seems to have resolved the problem > there. The problem still persists on trunk. We should either revert #19889 on > trunk as well, or fix the PR. > > cc [~cadonna] [~bbejeck] -- This message was sent by Atlassian Jira (v8.20.10#820010)