nick allen created KAFKA-8654:
---------------------------------

             Summary: Cant restart heartbeatThread if encountered unexpected 
exception in heartbeatloop。
                 Key: KAFKA-8654
                 URL: https://issues.apache.org/jira/browse/KAFKA-8654
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 2.1.0
            Reporter: nick allen


There is a consumer in our cluster which have relatively high cpu usage for 
several days caused by kafka poll thread. So we dig in to find out that was 
because 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat
 returned zero leading to non-blocking select which in turn leading to 
pollForFetches returned immediately. But the actual poll timeout is set to 1s, 
so pollForFetches was called thousands of time per poll/second.

We use tool to inspect memory variables which show the corresponding 
heartbeatTimer's attribute:  

@Timer[
 time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627],
 startMs=@Long[1562075783801], // Jul 02 2019 13:56:23
 currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21
 deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33
 ]

That shows that heartbeat hasn't been happening for about 10 days. And jstack 
shows the corresponding heartbeatThread is dead. Unfortunately we dont keep 
logs for that long so I cant figure out what happened then. 

IMO heartbeatThread is too important to be left dead, there should at least be 
some way to revive it, but it seems that startHeartbeatThreadIfNeeded can only 
be triggered by restarting or heartBeat itself.

It's also confusing that almost everything in 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run
 is async so it seems impossible for any exception to happen, so why there is 
so many catch clause?

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to