Filip created KAFKA-12731:
-----------------------------

             Summary: High number of rebalances lead to GC Overhead limit 
exceeded JVM crash
                 Key: KAFKA-12731
                 URL: https://issues.apache.org/jira/browse/KAFKA-12731
             Project: Kafka
          Issue Type: Bug
          Components: clients
    Affects Versions: 2.5.1
            Reporter: Filip
         Attachments: image-2021-04-29-15-39-12-608.png, 
image-2021-04-29-15-39-52-541.png, rebalancing.log

We have an application that uses Spring Cloud Stream which delegates to 
{{kafka-clients:2.5.1}}. 

The application is started as follows:
{code:java}
java -jar -Xmx3072m -XX:+CrashOnOutOfMemoryError reporting-service.jar{code}
Normally, the application starts and joins its consumer group which has a 
variable number of members. A rebalancing occurs on startup and partitions (of 
which there are *9* per topic) get assigned across all consumers in the group.

After starting two of these members within a short time of each other, we saw 
quite a large amount of rebalances on both clients which seems to have led to 
an extremely high CPU usage on one of the clients. Ultimately, this led to a 
{{GC Overhead limit exceeded}} JVM Crash as GC was unable to keep up and do 
meaningful work.

I've included logs in the {{rebalancing.log}} file attached to this issue.

In our monitoring, saw that the CPU usage for the container experiencing this 
issue shot up to over 300% (crashes occurred @11:00 & @12:07):

!image-2021-04-29-15-39-52-541.png!

We have plenty of JVM metrics but I am unsure which would be helpful in 
debugging this behaviour.

Since there are a lot of components at play here:
 * org.apache.kafka.clients.consumer.internals.Fetcher
 * org.apache.kafka.clients.consumer.internals.SubscriptionState
 * org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
 * org.apache.kafka.clients.consumer.internals.AbstractCoordinator
 * org.springframework.cloud.stream.binder.kafka.KafkaMessageChannelBinder$1

it seems like some kind of memory leak is occurring which prohibits the GC from 
reclaiming any meaningful memory until the connection can be stably established 
and the consumer has fully joined the group. 

Any pointers as to where we could look into deeper would be much appreciated.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to