[ https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939538#comment-16939538 ]
Will James edited comment on KAFKA-8950 at 9/27/19 3:24 PM: ------------------------------------------------------------ Here are relevant logs from before one occurance of the issue. Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.consumer.KafkaConsumer [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Subscribed to topic(s): [redacted] #012 Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info org.apache.kafka.clients.Metadata [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Cluster ID: 5hzjksKUQJSFOa84oTBWdw #012 Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Discovered group coordinator kafka-6-us-west-2.instana.io:9092 (id: 2147483646 rack: null) #012 Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Revoking previously assigned partitions [] #012 Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] (Re-)joining group #012 Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Successfully joined group with generation 2890 #012 Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Setting newly assigned partitions: [redacted]-0 #012 Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Setting offset for partition [redacted]-0 to the committed offset FetchPosition\{offset=8999851092, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=kafka-13-us-west-2.instana.io:9092 (id: 8 rack: null), epoch=-1}} #012 Sep 24 04:23:33 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.SubscriptionState [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Seeking to LATEST offset of partition [redacted]-0 #012 Sep 24 04:23:33 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.SubscriptionState [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Resetting offset for partition [redacted]-0 to offset 8999872779. #012 The consumer stopped consuming around 07:00 on the same day (although in some cases the error occurs days after the consumer starts). I don't really see anything super interesting in here. Basically it is just seeking to the end of the log. was (Author: wtjames): Here are relevant logs from before one occurance of the issue. {{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.consumer.KafkaConsumer [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Subscribed to topic(s): [redacted] #012 }} {{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info org.apache.kafka.clients.Metadata [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Cluster ID: 5hzjksKUQJSFOa84oTBWdw #012 }} {{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Discovered group coordinator kafka-6-us-west-2.instana.io:9092 (id: 2147483646 rack: null) #012 }} {{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Revoking previously assigned partitions [] #012 }} {{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] (Re-)joining group #012 }} {{Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Successfully joined group with generation 2890 #012 }} {{Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Setting newly assigned partitions: [redacted]-0 #012 }} {{Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Setting offset for partition [redacted]-0 to the committed offset FetchPosition\{offset=8999851092, offsetEpoch=Optional.empty, currentLeader=LeaderAndEpoch{leader=kafka-13-us-west-2.instana.io:9092 (id: 8 rack: null), epoch=-1}} #012 }} {{Sep 24 04:23:33 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.SubscriptionState [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Seeking to LATEST offset of partition [redacted]-0 #012 }} {{Sep 24 04:23:33 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info o.a.k.c.c.i.SubscriptionState [Consumer clientId=[redacted]_[redacted], groupId=[redacted]] Resetting offset for partition [redacted]-0 to offset 8999872779. #012 }} The consumer stopped consuming around 07:00 on the same day (although in some cases the error occurs days after the consumer starts). I don't really see anything super interesting in here. Basically it is just seeking to the end of the log. > KafkaConsumer stops fetching > ---------------------------- > > Key: KAFKA-8950 > URL: https://issues.apache.org/jira/browse/KAFKA-8950 > Project: Kafka > Issue Type: Bug > Components: clients > Affects Versions: 2.3.0 > Environment: linux > Reporter: Will James > Priority: Major > > We have a KafkaConsumer consuming from a single partition with > enable.auto.commit set to true. > Very occasionally, the consumer goes into a broken state. It returns no > records from the broker with every poll, and from most of the Kafka metrics > in the consumer it looks like it is fully caught up to the end of the log. > We see that we are long polling for the max poll timeout, and that there is > zero lag. In addition, we see that the heartbeat rate stays unchanged from > before the issue begins (so the consumer stays a part of the consumer group). > In addition, from looking at the __consumer_offsets topic, it is possible to > see that the consumer is committing the same offset on the auto commit > interval, however, the offset does not move, and the lag from the broker's > perspective continues to increase. > The issue is only resolved by restarting our application (which restarts the > KafkaConsumer instance). > From a heap dump of an application in this state, I can see that the Fetcher > is in a state where it believes there are nodesWithPendingFetchRequests. > However, I can see the state of the fetch latency sensor, specifically, the > fetch rate, and see that the samples were not updated for a long period of > time (actually, precisely the amount of time that the problem in our > application was occurring, around 50 hours - we have alerting on other > metrics but not the fetch rate, so we didn't notice the problem until a > customer complained). > In this example, the consumer was processing around 40 messages per second, > with an average size of about 10kb, although most of the other examples of > this have happened with higher volume (250 messages / second, around 23kb per > message on average). > I have spent some time investigating the issue on our end, and will continue > to do so as time allows, however I wanted to raise this as an issue because > it may be affecting other people. > Please let me know if you have any questions or need additional information. > I doubt I can provide heap dumps unfortunately, but I can provide further > information as needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)