[ https://issues.apache.org/jira/browse/KAFKA-13615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491742#comment-17491742 ]
Guozhang Wang commented on KAFKA-13615: --------------------------------------- [~timcosta] You mean your brokers are on the MSK right? That's fine since you can use a different version of clients independent of your broker version; if you manage the clients yourself you can still upgrade it. > Kafka Streams does not transition state on LeaveGroup due to poll interval > being exceeded > ----------------------------------------------------------------------------------------- > > Key: KAFKA-13615 > URL: https://issues.apache.org/jira/browse/KAFKA-13615 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.8.1 > Reporter: Tim Costa > Priority: Major > > We are running a KafkaStreams application with largely default settings. > Occasionally one of our consumers in the group takes too long between polls, > and streams leaves the consumer group but the state of the application > remains `RUNNING`. We are using the default `max.poll.interval.ms` of 5000. > The process stays alive with no exception that bubbles to our code, so when > this occurs our app just kinda sits there idle until a manual restart is > performed. > Here are the logs from around the time of the problem: > {code:java} > {"timestamp":"2022-01-24 > 19:56:44.404","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread > [kubepodname-StreamThread-1] Processed 65296 total records, ran 0 > punctuators, and committed 400 total tasks since the last > update","context":"default"} {"timestamp":"2022-01-24 > 19:58:44.478","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread > [kubepodname-StreamThread-1] Processed 65284 total records, ran 0 > punctuators, and committed 400 total tasks since the last > update","context":"default"} > {"timestamp":"2022-01-24 > 20:03:50.383","level":"INFO","thread":"kafka-coordinator-heartbeat-thread | > stage-us-1-fanout-logs-2c99","logger":"org.apache.kafka.clients.consumer.internals.AbstractCoordinator","message":"[Consumer > clientId=kubepodname-StreamThread-1-consumer, > groupId=stage-us-1-fanout-logs-2c99] Member > kubepodname-StreamThread-1-consumer-283f0e0d-defa-4edf-88b2-39703f845db5 > sending LeaveGroup request to coordinator > b-2.***.kafka.us-east-1.amazonaws.com:9096 (id: 2147483645 rack: null) due to > consumer poll timeout has expired. This means the time between subsequent > calls to poll() was longer than the configured max.poll.interval.ms, which > typically implies that the poll loop is spending too much time processing > messages. You can address this either by increasing max.poll.interval.ms or > by reducing the maximum size of batches returned in poll() with > max.poll.records.","context":"default"} {code} > At this point the application entirely stops processing data. We initiated a > shutdown by deleting the kubernetes pod, and the line printed immediately by > kafka after the sprint boot shutdown initiation logs is the following: > {code:java} > {"timestamp":"2022-01-24 > 20:05:27.368","level":"INFO","thread":"kafka-streams-close-thread","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread > [kubepodname-StreamThread-1] State transition from RUNNING to > PENDING_SHUTDOWN","context":"default"} > {code} > For a period of over a minute the application was in a state of hiatus where > it had left the group, however it was still marked as being in a `RUNNING` > state so we had no way to detect that the application had entered a bad state > to kill it in an automated fashion. While the above logs are from an app that > we shut down within a minute or two manually, we have seen this stay in a bad > state for up to an hour before. > It feels like a bug to me that the streams consumer can leave the consumer > group but not exit the `RUNNING` state. I tried searching for other bugs like > this, but couldn't find any. Any ideas on how to detect this, or thoughts on > whether this is actually a bug? -- This message was sent by Atlassian Jira (v8.20.1#820001)