Filip created KAFKA-12731:
-----------------------------
Summary: High number of rebalances lead to GC Overhead limit
exceeded JVM crash
Key: KAFKA-12731
URL: https://issues.apache.org/jira/browse/KAFKA-12731
Project: Kafka
Issue Type: Bug
Components: clients
Affects Versions: 2.5.1
Reporter: Filip
Attachments: image-2021-04-29-15-39-12-608.png,
image-2021-04-29-15-39-52-541.png, rebalancing.log
We have an application that uses Spring Cloud Stream which delegates to
{{kafka-clients:2.5.1}}.
The application is started as follows:
{code:java}
java -jar -Xmx3072m -XX:+CrashOnOutOfMemoryError reporting-service.jar{code}
Normally, the application starts and joins its consumer group which has a
variable number of members. A rebalancing occurs on startup and partitions (of
which there are *9* per topic) get assigned across all consumers in the group.
After starting two of these members within a short time of each other, we saw
quite a large amount of rebalances on both clients which seems to have led to
an extremely high CPU usage on one of the clients. Ultimately, this led to a
{{GC Overhead limit exceeded}} JVM Crash as GC was unable to keep up and do
meaningful work.
I've included logs in the {{rebalancing.log}} file attached to this issue.
In our monitoring, saw that the CPU usage for the container experiencing this
issue shot up to over 300% (crashes occurred @11:00 & @12:07):
!image-2021-04-29-15-39-52-541.png!
We have plenty of JVM metrics but I am unsure which would be helpful in
debugging this behaviour.
Since there are a lot of components at play here:
* org.apache.kafka.clients.consumer.internals.Fetcher
* org.apache.kafka.clients.consumer.internals.SubscriptionState
* org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
* org.apache.kafka.clients.consumer.internals.AbstractCoordinator
* org.springframework.cloud.stream.binder.kafka.KafkaMessageChannelBinder$1
it seems like some kind of memory leak is occurring which prohibits the GC from
reclaiming any meaningful memory until the connection can be stably established
and the consumer has fully joined the group.
Any pointers as to where we could look into deeper would be much appreciated.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)