Filip created KAFKA-12731: ----------------------------- Summary: High number of rebalances lead to GC Overhead limit exceeded JVM crash Key: KAFKA-12731 URL: https://issues.apache.org/jira/browse/KAFKA-12731 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 2.5.1 Reporter: Filip Attachments: image-2021-04-29-15-39-12-608.png, image-2021-04-29-15-39-52-541.png, rebalancing.log
We have an application that uses Spring Cloud Stream which delegates to {{kafka-clients:2.5.1}}. The application is started as follows: {code:java} java -jar -Xmx3072m -XX:+CrashOnOutOfMemoryError reporting-service.jar{code} Normally, the application starts and joins its consumer group which has a variable number of members. A rebalancing occurs on startup and partitions (of which there are *9* per topic) get assigned across all consumers in the group. After starting two of these members within a short time of each other, we saw quite a large amount of rebalances on both clients which seems to have led to an extremely high CPU usage on one of the clients. Ultimately, this led to a {{GC Overhead limit exceeded}} JVM Crash as GC was unable to keep up and do meaningful work. I've included logs in the {{rebalancing.log}} file attached to this issue. In our monitoring, saw that the CPU usage for the container experiencing this issue shot up to over 300% (crashes occurred @11:00 & @12:07): !image-2021-04-29-15-39-52-541.png! We have plenty of JVM metrics but I am unsure which would be helpful in debugging this behaviour. Since there are a lot of components at play here: * org.apache.kafka.clients.consumer.internals.Fetcher * org.apache.kafka.clients.consumer.internals.SubscriptionState * org.apache.kafka.clients.consumer.internals.ConsumerCoordinator * org.apache.kafka.clients.consumer.internals.AbstractCoordinator * org.springframework.cloud.stream.binder.kafka.KafkaMessageChannelBinder$1 it seems like some kind of memory leak is occurring which prohibits the GC from reclaiming any meaningful memory until the connection can be stably established and the consumer has fully joined the group. Any pointers as to where we could look into deeper would be much appreciated. -- This message was sent by Atlassian Jira (v8.3.4#803005)