[ https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16950707#comment-16950707 ]
Tom Lee commented on KAFKA-8950: -------------------------------- Ah wait I misunderstood what you said above: so not trouble with the listener execution itself, but the updates to the nodesWithPendingFetchRequests map. I don't think it's a problem because Fetcher.send() is synchronized, and both onSuccess/onFailure are protected by synchronized blocks. Even if the future fires first, send() still holds the lock. > KafkaConsumer stops fetching > ---------------------------- > > Key: KAFKA-8950 > URL: https://issues.apache.org/jira/browse/KAFKA-8950 > Project: Kafka > Issue Type: Bug > Components: clients > Affects Versions: 2.3.0 > Environment: linux > Reporter: Will James > Priority: Major > > We have a KafkaConsumer consuming from a single partition with > enable.auto.commit set to true. > Very occasionally, the consumer goes into a broken state. It returns no > records from the broker with every poll, and from most of the Kafka metrics > in the consumer it looks like it is fully caught up to the end of the log. > We see that we are long polling for the max poll timeout, and that there is > zero lag. In addition, we see that the heartbeat rate stays unchanged from > before the issue begins (so the consumer stays a part of the consumer group). > In addition, from looking at the __consumer_offsets topic, it is possible to > see that the consumer is committing the same offset on the auto commit > interval, however, the offset does not move, and the lag from the broker's > perspective continues to increase. > The issue is only resolved by restarting our application (which restarts the > KafkaConsumer instance). > From a heap dump of an application in this state, I can see that the Fetcher > is in a state where it believes there are nodesWithPendingFetchRequests. > However, I can see the state of the fetch latency sensor, specifically, the > fetch rate, and see that the samples were not updated for a long period of > time (actually, precisely the amount of time that the problem in our > application was occurring, around 50 hours - we have alerting on other > metrics but not the fetch rate, so we didn't notice the problem until a > customer complained). > In this example, the consumer was processing around 40 messages per second, > with an average size of about 10kb, although most of the other examples of > this have happened with higher volume (250 messages / second, around 23kb per > message on average). > I have spent some time investigating the issue on our end, and will continue > to do so as time allows, however I wanted to raise this as an issue because > it may be affecting other people. > Please let me know if you have any questions or need additional information. > I doubt I can provide heap dumps unfortunately, but I can provide further > information as needed. -- This message was sent by Atlassian Jira (v8.3.4#803005)