Hello Kafka Team,

we are observing some unexpected behavior in the Java Kafka client:

Problem description:

When a KafkaShareConsumer fails to connect to a cluster (because e.g. a port is misconfigured, ...) it enters a busy loop. The symptoms are an excessive amount of logs, high CPU usage and a slowly increasing memory footprint.

Software Version:
We are using the org.apache.kafka.kafka-clients:4.1.1.

Sample:
I created a repository with a minimal sample to reproduce the behavior: https://github.com/HenrikLueschenTNG/share-consumer-busy-loop/blob/main/src/main/java/com/example/shareconsumerbusyloop/ShareConsumerBusyLoopApplication.java

Details:
When the consumer fails to establish a connection, we first see a large amount of identical logs, often many published within the same millisecond:

2026-01-16 07:49:59.311 INFO [consumer_background_thread] org.apache.kafka.clients.Metadata - [ShareConsumer clientId=consumer-test-group-1, groupId=test-group] Rebootstrapping with [localhost/127.0.0.1:9094] 2026-01-16 07:49:59.311 INFO [consumer_background_thread] org.apache.kafka.clients.Metadata - [ShareConsumer clientId=consumer-test-group-1, groupId=test-group] Rebootstrapping with [localhost/127.0.0.1:9094] 2026-01-16 07:49:59.311 INFO [consumer_background_thread] org.apache.kafka.clients.Metadata - [ShareConsumer clientId=consumer-test-group-1, groupId=test-group] Rebootstrapping with [localhost/127.0.0.1:9094]

After a few seconds, the production of these logs ends, but the CPU usage remains very high.

I have done a little but of digging and found the following:

- Within the loop of the ConsumerNetworkThread, several RequestManagers are used to determine the timeout for the next poll to the networkClientDelegate. The CoordinatorRequestManager frequently sets this timeout to zero. Its timeout is calculated as Math.max(0, backoffMs - timeSinceLastReceiveMs); As the backoff is, by default, between 100ms-1000ms but the request timeout is 30000ms, the difference between the backoff and the timeSinceLastReceived is almost always negative when no connection can be made. I think this is causing the initial symptom of the many logs.

- After a few seconds, the client stops producing logs, but the CPU usage remains high. Additionally, a slow increase of memory usage can be observed. I believe this is due to an accumulation of applicationEvents in the ConsumerNetworkThread. I have observed that within a few seconds several million such events need to be (and cannot be) processed in the call to "processApplicationEvents". This appears to slow down the loop in the ConsumerNetworkThread, resulting in the production of fewer logs, while simultaneously keeping the CPU busy and using increasing amounts of memory.

- In the case of a classic consumer, no such behavior can be observed.


Thanks in advance on any advice on this issue!
Greetings
Henrik



Reply via email to