[ 
https://issues.apache.org/jira/browse/KAFKA-13615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491742#comment-17491742
 ] 

Guozhang Wang commented on KAFKA-13615:
---------------------------------------

[~timcosta] You mean your brokers are on the MSK right? That's fine since you 
can use a different version of clients independent of your broker version; if 
you manage the clients yourself you can still upgrade it.

> Kafka Streams does not transition state on LeaveGroup due to poll interval 
> being exceeded
> -----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13615
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13615
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.8.1
>            Reporter: Tim Costa
>            Priority: Major
>
> We are running a KafkaStreams application with largely default settings. 
> Occasionally one of our consumers in the group takes too long between polls, 
> and streams leaves the consumer group but the state of the application 
> remains `RUNNING`. We are using the default `max.poll.interval.ms` of 5000.
> The process stays alive with no exception that bubbles to our code, so when 
> this occurs our app just kinda sits there idle until a manual restart is 
> performed.
> Here are the logs from around the time of the problem:
> {code:java}
> {"timestamp":"2022-01-24 
> 19:56:44.404","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
>  [kubepodname-StreamThread-1] Processed 65296 total records, ran 0 
> punctuators, and committed 400 total tasks since the last 
> update","context":"default"} {"timestamp":"2022-01-24 
> 19:58:44.478","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
>  [kubepodname-StreamThread-1] Processed 65284 total records, ran 0 
> punctuators, and committed 400 total tasks since the last 
> update","context":"default"}
> {"timestamp":"2022-01-24 
> 20:03:50.383","level":"INFO","thread":"kafka-coordinator-heartbeat-thread | 
> stage-us-1-fanout-logs-2c99","logger":"org.apache.kafka.clients.consumer.internals.AbstractCoordinator","message":"[Consumer
>  clientId=kubepodname-StreamThread-1-consumer, 
> groupId=stage-us-1-fanout-logs-2c99] Member 
> kubepodname-StreamThread-1-consumer-283f0e0d-defa-4edf-88b2-39703f845db5 
> sending LeaveGroup request to coordinator 
> b-2.***.kafka.us-east-1.amazonaws.com:9096 (id: 2147483645 rack: null) due to 
> consumer poll timeout has expired. This means the time between subsequent 
> calls to poll() was longer than the configured max.poll.interval.ms, which 
> typically implies that the poll loop is spending too much time processing 
> messages. You can address this either by increasing max.poll.interval.ms or 
> by reducing the maximum size of batches returned in poll() with 
> max.poll.records.","context":"default"} {code}
> At this point the application entirely stops processing data. We initiated a 
> shutdown by deleting the kubernetes pod, and the line printed immediately by 
> kafka after the sprint boot shutdown initiation logs is the following:
> {code:java}
> {"timestamp":"2022-01-24 
> 20:05:27.368","level":"INFO","thread":"kafka-streams-close-thread","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread
>  [kubepodname-StreamThread-1] State transition from RUNNING to 
> PENDING_SHUTDOWN","context":"default"}
>  {code}
> For a period of over a minute the application was in a state of hiatus where 
> it had left the group, however it was still marked as being in a `RUNNING` 
> state so we had no way to detect that the application had entered a bad state 
> to kill it in an automated fashion. While the above logs are from an app that 
> we shut down within a minute or two manually, we have seen this stay in a bad 
> state for up to an hour before.
> It feels like a bug to me that the streams consumer can leave the consumer 
> group but not exit the `RUNNING` state. I tried searching for other bugs like 
> this, but couldn't find any. Any ideas on how to detect this, or thoughts on 
> whether this is actually a bug?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to