[ 
https://issues.apache.org/jira/browse/KAFKA-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939538#comment-16939538
 ] 

Will James edited comment on KAFKA-8950 at 9/27/19 3:24 PM:
------------------------------------------------------------

Here are relevant logs from before one occurance of the issue. 

Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.consumer.KafkaConsumer [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Subscribed to topic(s): [redacted] #012 
Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
org.apache.kafka.clients.Metadata [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Cluster ID: 5hzjksKUQJSFOa84oTBWdw #012 
Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Discovered group coordinator 
kafka-6-us-west-2.instana.io:9092 (id: 2147483646 rack: null) #012 
Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Revoking previously assigned partitions [] #012 
Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] (Re-)joining group #012 
Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Successfully joined group with generation 2890 #012 
Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Setting newly assigned partitions: [redacted]-0 #012 
Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Setting offset for partition [redacted]-0 to the committed 
offset FetchPosition\{offset=8999851092, offsetEpoch=Optional.empty, 
currentLeader=LeaderAndEpoch{leader=kafka-13-us-west-2.instana.io:9092 (id: 8 
rack: null), epoch=-1}} #012 
Sep 24 04:23:33 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.SubscriptionState [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Seeking to LATEST offset of partition [redacted]-0 #012 
Sep 24 04:23:33 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.SubscriptionState [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Resetting offset for partition [redacted]-0 to offset 
8999872779. #012

The consumer stopped consuming around 07:00 on the same day (although in some 
cases the error occurs days after the consumer starts). I don't really see 
anything super interesting in here. Basically it is just seeking to the end of 
the log.


was (Author: wtjames):
Here are relevant logs from before one occurance of the issue. 

{{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.consumer.KafkaConsumer [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Subscribed to topic(s): [redacted] #012 }}
{{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
org.apache.kafka.clients.Metadata [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Cluster ID: 5hzjksKUQJSFOa84oTBWdw #012 }}
{{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Discovered group coordinator 
kafka-6-us-west-2.instana.io:9092 (id: 2147483646 rack: null) #012 }}
{{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Revoking previously assigned partitions [] #012 }}
{{Sep 24 04:23:29 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] (Re-)joining group #012 }}
{{Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.AbstractCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Successfully joined group with generation 2890 #012 }}
{{Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Setting newly assigned partitions: [redacted]-0 #012 }}
{{Sep 24 04:23:32 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.ConsumerCoordinator [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Setting offset for partition [redacted]-0 to the committed 
offset FetchPosition\{offset=8999851092, offsetEpoch=Optional.empty, 
currentLeader=LeaderAndEpoch{leader=kafka-13-us-west-2.instana.io:9092 (id: 8 
rack: null), epoch=-1}} #012 }}
{{Sep 24 04:23:33 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.SubscriptionState [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Seeking to LATEST offset of partition [redacted]-0 #012 }}
{{Sep 24 04:23:33 fleet-worker-highperf-17-us-west-2.instana.io [redacted] info 
o.a.k.c.c.i.SubscriptionState [Consumer clientId=[redacted]_[redacted], 
groupId=[redacted]] Resetting offset for partition [redacted]-0 to offset 
8999872779. #012 }}

The consumer stopped consuming around 07:00 on the same day (although in some 
cases the error occurs days after the consumer starts). I don't really see 
anything super interesting in here. Basically it is just seeking to the end of 
the log.

> KafkaConsumer stops fetching
> ----------------------------
>
>                 Key: KAFKA-8950
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8950
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 2.3.0
>         Environment: linux
>            Reporter: Will James
>            Priority: Major
>
> We have a KafkaConsumer consuming from a single partition with 
> enable.auto.commit set to true.
> Very occasionally, the consumer goes into a broken state. It returns no 
> records from the broker with every poll, and from most of the Kafka metrics 
> in the consumer it looks like it is fully caught up to the end of the log. 
> We see that we are long polling for the max poll timeout, and that there is 
> zero lag. In addition, we see that the heartbeat rate stays unchanged from 
> before the issue begins (so the consumer stays a part of the consumer group).
> In addition, from looking at the __consumer_offsets topic, it is possible to 
> see that the consumer is committing the same offset on the auto commit 
> interval, however, the offset does not move, and the lag from the broker's 
> perspective continues to increase.
> The issue is only resolved by restarting our application (which restarts the 
> KafkaConsumer instance).
> From a heap dump of an application in this state, I can see that the Fetcher 
> is in a state where it believes there are nodesWithPendingFetchRequests.
> However, I can see the state of the fetch latency sensor, specifically, the 
> fetch rate, and see that the samples were not updated for a long period of 
> time (actually, precisely the amount of time that the problem in our 
> application was occurring, around 50 hours - we have alerting on other 
> metrics but not the fetch rate, so we didn't notice the problem until a 
> customer complained).
> In this example, the consumer was processing around 40 messages per second, 
> with an average size of about 10kb, although most of the other examples of 
> this have happened with higher volume (250 messages / second, around 23kb per 
> message on average).
> I have spent some time investigating the issue on our end, and will continue 
> to do so as time allows, however I wanted to raise this as an issue because 
> it may be affecting other people.
> Please let me know if you have any questions or need additional information. 
> I doubt I can provide heap dumps unfortunately, but I can provide further 
> information as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to