[jira] [Created] (KAFKA-19427) Kafka 4.0 may have a memory leak, causing an OOM exception

RivenSun (Jira) Mon, 23 Jun 2025 05:02:52 -0700

RivenSun created KAFKA-19427:
--------------------------------

             Summary: Kafka 4.0 may have a memory leak, causing an OOM exception
                 Key: KAFKA-19427
                 URL: https://issues.apache.org/jira/browse/KAFKA-19427
             Project: Kafka
          Issue Type: Bug
          Components: consumer, group-coordinator
    Affects Versions: 4.0.0
            Reporter: RivenSun
         Attachments: image-2025-06-23-14-16-00-554.png, 
image-2025-06-23-14-17-34-767.png, image-2025-06-23-14-28-51-524.png, 
image-2025-06-23-14-31-47-453.png, image-2025-06-23-15-01-32-074.png, 
image-2025-06-23-15-04-15-708.png, image-2025-06-23-15-04-26-598.png, 
image-2025-06-23-15-11-13-026.png, image-2025-06-23-15-33-06-851.png, 
image-2025-06-23-15-33-26-209.png


h3. Kafka cluster configuration

1.Kafka version：4.0
2.The cluster specifications are: 3 brokers and 3 controllers
3.JVM startup parameters:
!image-2025-06-23-14-16-00-554.png!

4.JDK version：
!image-2025-06-23-14-17-34-767.png!
h3. 
Steps to reproduce the problem



1.In this new cluster, create a test topic: {*}test{*}，and this cluster will 
eventually have *only this one topic* tested by external users.


topic config : NewTopic newTopic = new NewTopic("test", 3, (short) 1);
2.Start the producer and send 1,000 messages
3.Start the consumer and use the earliest strategy for consumption. The 
groupIds are rivenTest1/rivenTest2/.../rivenTest8

4.During the process of starting the consumer, it was found that some consumer 
groups failed to start, and the coordinator brokers corresponding to these 
groups also had OOM exceptions

client error logs：
{code:java}
[main] INFO org.apache.kafka.common.telemetry.internals.KafkaMetricsCollector - 
initializing Kafka metrics collector
[main] INFO org.apache.kafka.common.security.authenticator.AbstractLogin - 
Successfully logged in.
[main] INFO org.apache.kafka.common.utils.AppInfoParser - Kafka version: 3.9.1
[main] INFO org.apache.kafka.common.utils.AppInfoParser - Kafka commitId: 
f745dfdcee2b9851
[main] INFO org.apache.kafka.common.utils.AppInfoParser - Kafka startTimeMs: 
1750661985923
[main] INFO org.apache.kafka.clients.consumer.internals.ClassicKafkaConsumer - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] Subscribed to 
topic(s): test
[main] INFO org.apache.kafka.clients.Metadata - [Consumer 
clientId=consumer-rivenTest6-1, groupId=rivenTest6] Cluster ID: 
3esGOWhETi-zo2uHq7NsFg
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] Discovered group 
coordinator 18-97-25-88-k.mq.zoomdev.us:9889 (id: 2147483644 rack: null)
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] (Re-)joining group
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] Request joining 
group due to: need to re-join with the given member-id: 
consumer-rivenTest6-1-38849218-32fa-430d-b14c-d3ce7ff402c4
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] (Re-)joining group
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] Successfully 
joined group with generation Generation{generationId=17, 
memberId='consumer-rivenTest6-1-38849218-32fa-430d-b14c-d3ce7ff402c4', 
protocol='roundrobin'}
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] Finished 
assignment for group at generation 17: 
{consumer-rivenTest6-1-38849218-32fa-430d-b14c-d3ce7ff402c4=Assignment(partitions=[test-0,
 test-1, test-2])}
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] Request joining 
group due to: rebalance failed due to 'Unexpected error from SyncGroup: The 
server experienced an unexpected error when processing the request.' 
(KafkaException)
org.apache.kafka.common.KafkaException: Unexpected error from SyncGroup: The 
server experienced an unexpected error when processing the request.
    at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:893)
    at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$SyncGroupResponseHandler.handle(AbstractCoordinator.java:812)
    at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1311)
    at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1286)
    at 
org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:206)
    at 
org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:169)
    at 
org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:129)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:617)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:429)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:314)
    at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:253)
    at 
org.apache.kafka.clients.consumer.internals.ClassicKafkaConsumer.pollForFetches(ClassicKafkaConsumer.java:692)
    at 
org.apache.kafka.clients.consumer.internals.ClassicKafkaConsumer.poll(ClassicKafkaConsumer.java:623)
    at 
org.apache.kafka.clients.consumer.internals.ClassicKafkaConsumer.poll(ClassicKafkaConsumer.java:596)
    at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:874)
    at us.zoom.mq.examples.ConsumerTest.startConsumer(ConsumerTest.java:233)
    at us.zoom.mq.examples.ConsumerTest.main(ConsumerTest.java:149)
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] Member 
consumer-rivenTest6-1-38849218-32fa-430d-b14c-d3ce7ff402c4 sending LeaveGroup 
request to coordinator 18-97-25-88-k.mq.zoomdev.us:9889 (id: 2147483644 rack: 
null) due to the consumer is being closed
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] Resetting 
generation and member id due to: consumer pro-actively leaving the group
[main] INFO org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] Request joining 
group due to: consumer pro-actively leaving the group
[main] ERROR org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - 
[Consumer clientId=consumer-rivenTest6-1, groupId=rivenTest6] LeaveGroup 
request with Generation{generationId=17, 
memberId='consumer-rivenTest6-1-38849218-32fa-430d-b14c-d3ce7ff402c4', 
protocol='roundrobin'} failed with error: The server experienced an unexpected 
error when processing the request.
[main] INFO org.apache.kafka.common.metrics.Metrics - Metrics scheduler closed
[main] INFO org.apache.kafka.common.metrics.Metrics - Closing reporter 
org.apache.kafka.common.metrics.JmxReporter
[main] INFO org.apache.kafka.common.metrics.Metrics - Closing reporter 
org.apache.kafka.common.telemetry.internals.ClientTelemetryReporter
[main] INFO org.apache.kafka.common.metrics.Metrics - Metrics reporters closed
[main] INFO org.apache.kafka.common.utils.AppInfoParser - App info 
kafka.consumer for consumer-rivenTest6-1 unregistered {code}

coordinator broker error logs：
!image-2025-06-23-15-04-26-598.png!

 

 
h3. Analysis:

This is a brand new Kafka4.0 cluster with only one topic created;
The JDK version is 17;
Why does the broker encounter OOM so quickly when it is just sending and 
consuming data? Is there a memory leak somewhere?
1 First, use the arthas tool to analyze the memory usage

!image-2025-06-23-15-04-15-708.png!

We can see that most of the heap memory is {*}occupied by the old 
generation{*}, and it is likely that the program will directly experience OOM 
of the heap memory when it needs to {color:#FF0000}*apply for a large object. 
It should be noted that the maximum memory we allocate to the Kafka process is 
actually 3G, and there is also a lot of space left in the heap memory. Why does 
it directly trigger the Java heap space type OOM in this case?*
{color}

{color:#172b4d}2.Dump memory snapshots and use tools to analyze what is 
currently occupying a large amount of memory in the program
!image-2025-06-23-15-33-06-851.png!

!image-2025-06-23-15-33-26-209.png!
After analyzing the memory usage, I found that it was basically all the 
*coordinators* objects in the *CoordinatorRuntime* class that occupied the 
memory and did not release it; coordinators is a ConcurrentHashMap structure, 
the key is the TopicPartition type, and the value is the CoordinatorContext 
type.
!image-2025-06-23-15-11-13-026.png!

Why does a broker machine simply start a consumer, the topic has only three 
partitions, and the consumer group uses no more than 10 partitions in total, 
and the *coordinators* object in the broker process occupies such a large 
amount of memory and does not release it?
Is there a problem with the broker configuration or the JDK17 version or the 
jvm startup parameters, or is there a memory leak in the kafka 4.0 version code?
{color}{color:#172b4d}Please help analyze and answer, looking forward to your 
reply.
Thank you very much!
{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-19427) Kafka 4.0 may have a memory leak, causing an OOM exception

Reply via email to