[ https://issues.apache.org/jira/browse/KAFKA-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
nick allen updated KAFKA-8654: ------------------------------ Description: There is a consumer in our cluster which has relatively high cpu usage for several days caused by kafka poll thread. So we dig in to find out that was because org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat returned zero leading to non-blocking select which in turn leading to pollForFetches returned immediately. But the actual poll timeout is set to 1s, so pollForFetches was called thousands of time per poll/second. We use tool to inspect memory variables which show the corresponding heartbeatTimer's attribute: @Timer[ time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627], startMs=@Long[1562075783801], // Jul 02 2019 13:56:23 currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21 deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33 ] That shows that heartbeat hasn't been happening for about 10 days, and at 07-02 13:56 we did restarted brokers . And jstack shows the corresponding heartbeatThread is dead. Unfortunately we dont keep logs for that long so I cant figure out what happened then. IMO heartbeatThread is too important to be left dead, there should be at least some way to revive it, but it seems that startHeartbeatThreadIfNeeded can only be triggered by restarting or heartBeat itself. It's also confusing that almost everything in org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run is async so it seems impossible for any exception to happen, so why is there so many catch clause? was: There is a consumer in our cluster which has relatively high cpu usage for several days caused by kafka poll thread. So we dig in to find out that was because org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat returned zero leading to non-blocking select which in turn leading to pollForFetches returned immediately. But the actual poll timeout is set to 1s, so pollForFetches was called thousands of time per poll/second. We use tool to inspect memory variables which show the corresponding heartbeatTimer's attribute: @Timer[ time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627], startMs=@Long[1562075783801], // Jul 02 2019 13:56:23 currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21 deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33 ] That shows that heartbeat hasn't been happening for about 10 days. And jstack shows the corresponding heartbeatThread is dead. Unfortunately we dont keep logs for that long so I cant figure out what happened then. IMO heartbeatThread is too important to be left dead, there should be at least some way to revive it, but it seems that startHeartbeatThreadIfNeeded can only be triggered by restarting or heartBeat itself. It's also confusing that almost everything in org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run is async so it seems impossible for any exception to happen, so why is there so many catch clause? > Cant restart heartbeatThread if encountered unexpected exception in > heartbeatloop. > ---------------------------------------------------------------------------------- > > Key: KAFKA-8654 > URL: https://issues.apache.org/jira/browse/KAFKA-8654 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 2.1.0 > Reporter: nick allen > Priority: Major > > There is a consumer in our cluster which has relatively high cpu usage for > several days caused by kafka poll thread. So we dig in to find out that was > because > org.apache.kafka.clients.consumer.internals.AbstractCoordinator#timeToNextHeartbeat > returned zero leading to non-blocking select which in turn leading to > pollForFetches returned immediately. But the actual poll timeout is set to > 1s, so pollForFetches was called thousands of time per poll/second. > We use tool to inspect memory variables which show the corresponding > heartbeatTimer's attribute: > @Timer[ > time=@SystemTime[org.apache.kafka.common.utils.SystemTime@4d806627], > startMs=@Long[1562075783801], // Jul 02 2019 13:56:23 > currentTimeMs=@Long[1562823681506], // Thu Jul 11 2019 05:41:21 > deadlineMs=@Long[1562075793801], // Tue Jul 02 2019 13:56:33 > ] > That shows that heartbeat hasn't been happening for about 10 days, and at > 07-02 13:56 we did restarted brokers . And jstack shows the corresponding > heartbeatThread is dead. Unfortunately we dont keep logs for that long so I > cant figure out what happened then. > IMO heartbeatThread is too important to be left dead, there should be at > least some way to revive it, but it seems that startHeartbeatThreadIfNeeded > can only be triggered by restarting or heartBeat itself. > It's also confusing that almost everything in > org.apache.kafka.clients.consumer.internals.AbstractCoordinator.HeartbeatThread#run > is async so it seems impossible for any exception to happen, so why is there > so many catch clause? > -- This message was sent by Atlassian JIRA (v7.6.14#76016)