[ https://issues.apache.org/jira/browse/KAFKA-5512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismael Juma updated KAFKA-5512: ------------------------------- Labels: performance (was: ) > KafkaConsumer: High memory allocation rate when idle > ---------------------------------------------------- > > Key: KAFKA-5512 > URL: https://issues.apache.org/jira/browse/KAFKA-5512 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 0.10.1.1 > Reporter: Stephane Roset > Labels: performance > Fix For: 0.11.0.1 > > > Hi, > We noticed in our application that the memory allocation rate increased > significantly when we have no Kafka messages to consume. We isolated the > issue by using a JVM that simply runs 128 Kafka consumers. These consumers > consume 128 partitions (so each consumer consumes one partition). The > partitions are empty and no message has been sent during the test. The > consumers were configured with default values (session.timeout.ms=30000, > fetch.max.wait.ms=500, receive.buffer.bytes=65536, > heartbeat.interval.ms=3000, max.poll.interval.ms=300000, > max.poll.records=500). The Kafka cluster was made of 3 brokers. Within this > context, the allocation rate was about 55 MiB/s. This high allocation rate > generates a lot of GC activity (to garbage the young heap) and was an issue > for our project. > We profiled the JVM with JProfiler. We noticed that there were a huge > quantity of ArrayList$Itr in memory. These collections were mainly > instantiated by the methods handleCompletedReceives, handleCompletedSends, > handleConnecions and handleDisconnections of the class NetWorkClient. We also > noticed that we had a lot of calls to the method pollOnce of the class > KafkaConsumer. > So we decided to run only one consumer and to profile the calls to the method > pollOnce. We noticed that regularly a huge number of calls is made to this > method, up to 268000 calls within 100ms. The pollOnce method calls the > NetworkClient.handle* methods. These methods iterate on collections (even if > they are empty), so that explains the huge number of iterators in memory. > The large number of calls is related to the heartbeat mechanism. The pollOnce > method calculates the poll timeout; if a heartbeat needs to be done, the > timeout will be set to 0. The problem is that the heartbeat thread checks > every 100 ms (default value of retry.backoff.ms) if a heartbeat should be > sent, so the KafkaConsumer will call the poll method in a loop without > timeout until the heartbeat thread awakes. For example: the heartbeat thread > just started to wait and will awake in 99ms. So during 99ms, the > KafkaConsumer will call in a loop the pollOnce method and will use a timeout > of 0. That explains how we can have 268000 calls within 100ms. > The heartbeat thread calls the method AbstractCoordinator.wait() to sleep, so > I think the Kafka consumer should awake the heartbeat thread with a notify > when needed. > We made two quick fixes to solve this issue: > - In NetworkClient.handle*(), we don't iterate on collections if they are > empty (to avoid unnecessary iterators instantiations). > - In KafkaConsumer.pollOnce(), if the poll timeout is equal to 0 we notify > the heartbeat thread to awake it (dirty fix because we don't handle the > autocommit case). > With these 2 quick fixes and 128 consumers, the allocation rate drops down > from 55 MiB/s to 4 MiB/s. -- This message was sent by Atlassian JIRA (v6.4.14#64029)