Stephane Roset created KAFKA-5512:
-------------------------------------

             Summary: KafkaConsumer: High memory allocation rate when idle
                 Key: KAFKA-5512
                 URL: https://issues.apache.org/jira/browse/KAFKA-5512
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 0.10.1.1
            Reporter: Stephane Roset


Hi,

We noticed in our application that the memory allocation rate increased 
significantly when we have no Kafka messages to consume. We isolated the issue 
by using a JVM that simply runs 128 Kafka consumers. These consumers consume 
128 partitions (so each consumer consumes one partition). The partitions are 
empty and no message has been sent during the test. The consumers were 
configured with default values (session.timeout.ms=30000, 
fetch.max.wait.ms=500, receive.buffer.bytes=65536, heartbeat.interval.ms=3000, 
max.poll.interval.ms=300000, max.poll.records=500). The Kafka cluster was made 
of 3 brokers. Within this context, the allocation rate was about 55 MiB/s. This 
high allocation rate generates a lot of GC activity (to garbage the young heap) 
and was an issue for our project.

We profiled the JVM with JProfiler. We noticed that there were a huge quantity 
of ArrayList$Itr in memory. These collections were mainly instantiated by the 
methods handleCompletedReceives, handleCompletedSends, handleConnecions and 
handleDisconnections of the class NetWorkClient. We also noticed that we had a 
lot of calls to the method pollOnce of the class KafkaConsumer. 

So we decided to run only one consumer and to profile the calls to the method 
pollOnce. We noticed that regularly a huge number of calls is made to this 
method, up to 268000 calls within 100ms. The pollOnce method calls the 
NetworkClient.handle* methods. These methods iterate on collections (even if 
they are empty), so that explains the huge number of iterators in memory.

The large number of calls is related to the heartbeat mechanism. The pollOnce 
method calculates the poll timeout; if a heartbeat needs to be done, the 
timeout will be set to 0. The problem is that the heartbeat thread checks every 
100 ms (default value of retry.backoff.ms) if a heartbeat should be sent, so 
the KafkaConsumer will call the poll method in a loop without timeout until the 
heartbeat thread awakes. For example: the heartbeat thread just started to wait 
and will awake in 99ms. So during 99ms, the KafkaConsumer will call in a loop 
the pollOnce method and will use a timeout of 0. That explains how we can have 
268000 calls within 100ms. 

The heartbeat thread calls the method AbstractCoordinator.wait() to sleep, so I 
think the Kafka consumer should awake the heartbeat thread with a notify when 
needed.

We made two quick fixes to solve this issue:
  - In NetworkClient.handle*(), we don't iterate on collections if they are 
empty (to avoid unnecessary iterators instantiations).
  - In KafkaConsumer.pollOnce(), if the poll timeout is equal to 0 we notify 
the heartbeat thread to awake it (dirty fix because we don't handle the 
autocommit case).

With these 2 quick fixes and 128 consumers, the allocation rate drops down from 
55 MiB/s to 4 MiB/s.








--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to