[ 
https://issues.apache.org/jira/browse/KAFKA-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nick allen updated KAFKA-8654:
------------------------------
    Description: 
There is a consumer in our cluster which has relatively high cpu usage for 
several days caused by kafka poll thread. So we dig in to find out that was 
because 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat
 returned zero leading to non-blocking select which in turn leading to 
pollForFetches returned immediately. But the actual poll timeout is set to 1s, 
so pollForFetches was called thousands of time per poll/second.

We use tool to inspect memory variables which show the corresponding 
heartbeatTimer's attribute:  

@Timer[
 time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627],
 startMs=@Long[1562075783801], // Jul 02 2019 13:56:23
 currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21
 deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33
 ]

That shows that heartbeat hasn't been happening for about 10 days, and *at 
07-02 13:56 we did restarted brokers*. And jstack shows the corresponding 
heartbeatThread is dead. Unfortunately we dont keep logs for that long so I 
cant figure out what happened then. 

IMO heartbeatThread is too important to be left dead, there should be at least 
some way to revive it, but it seems that startHeartbeatThreadIfNeeded can only 
be triggered by restarting or heartBeat itself.

It's also confusing that almost everything in 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run
 is async so it seems impossible for any exception to happen, so why is there 
so many catch clause?

 

  was:
There is a consumer in our cluster which has relatively high cpu usage for 
several days caused by kafka poll thread. So we dig in to find out that was 
because 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat
 returned zero leading to non-blocking select which in turn leading to 
pollForFetches returned immediately. But the actual poll timeout is set to 1s, 
so pollForFetches was called thousands of time per poll/second.

We use tool to inspect memory variables which show the corresponding 
heartbeatTimer's attribute:  

@Timer[
 time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627],
 startMs=@Long[1562075783801], // Jul 02 2019 13:56:23
 currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21
 deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33
 ]

That shows that heartbeat hasn't been happening for about 10 days, and at 07-02 
13:56 we did restarted brokers . And jstack shows the corresponding 
heartbeatThread is dead. Unfortunately we dont keep logs for that long so I 
cant figure out what happened then. 

IMO heartbeatThread is too important to be left dead, there should be at least 
some way to revive it, but it seems that startHeartbeatThreadIfNeeded can only 
be triggered by restarting or heartBeat itself.

It's also confusing that almost everything in 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run
 is async so it seems impossible for any exception to happen, so why is there 
so many catch clause?

 


> Cant restart heartbeatThread if encountered unexpected exception in 
> heartbeatloop.
> ----------------------------------------------------------------------------------
>
>                 Key: KAFKA-8654
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8654
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 2.1.0
>            Reporter: nick allen
>            Priority: Major
>
> There is a consumer in our cluster which has relatively high cpu usage for 
> several days caused by kafka poll thread. So we dig in to find out that was 
> because 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat
>  returned zero leading to non-blocking select which in turn leading to 
> pollForFetches returned immediately. But the actual poll timeout is set to 
> 1s, so pollForFetches was called thousands of time per poll/second.
> We use tool to inspect memory variables which show the corresponding 
> heartbeatTimer's attribute:  
> @Timer[
>  time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627],
>  startMs=@Long[1562075783801], // Jul 02 2019 13:56:23
>  currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21
>  deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33
>  ]
> That shows that heartbeat hasn't been happening for about 10 days, and *at 
> 07-02 13:56 we did restarted brokers*. And jstack shows the corresponding 
> heartbeatThread is dead. Unfortunately we dont keep logs for that long so I 
> cant figure out what happened then. 
> IMO heartbeatThread is too important to be left dead, there should be at 
> least some way to revive it, but it seems that startHeartbeatThreadIfNeeded 
> can only be triggered by restarting or heartBeat itself.
> It's also confusing that almost everything in 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run
>  is async so it seems impossible for any exception to happen, so why is there 
> so many catch clause?
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to