[ 
https://issues.apache.org/jira/browse/KAFKA-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402348#comment-16402348
 ] 

Rob Gevers commented on KAFKA-6671:
-----------------------------------

[~hachikuji] We have had issues more consistently with a cluster that has been 
impacted by https://issues.apache.org/jira/browse/KAFKA-5413 and we are in the 
process of cleaning that up. But we've also seen this issue occur in cases 
where there was only megabytes of offset data. Also, it seems like any amount 
of time to have no consumer coordinator is too long, so are you saying that it 
is normal for a consumer group to be without a coordinator at all on startup of 
a broker? The failover from the original coordinator happens seamlessly, but 
when that original coordinator comes back online the group coordination becomes 
problematic.

I'm sure that our log cleaning challenges are making this worse, but it still 
seems like a problem that there is *no* consumer group coordinator ever.

> Consumer group coordinator releases group before new coordinator is ready.
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-6671
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6671
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.10.2.1
>            Reporter: Rob Gevers
>            Priority: Major
>
> We regularly have an issue with our Kafka deploys which causes consumers to 
> be unable to consume for an extended period of time (up to an hour) after the 
> deploy finishes. The issue appears to be a side-effect of the way consumer 
> group coordination is managed between nodes. A sample timeline of a deploy 
> looks like the following:
> We initiate a clean shutdown of a node (which we will call kafka-2). We see 
> these traces:
> {noformat}
>  [2018-02-20 09:13:46,935] INFO [GroupCoordinator 1]: Loading group metadata 
> for ConsumerGroup with generation 3041 
> (kafka.coordinator.GroupCoordinator){noformat}
> {noformat}
>  [2018-02-20 09:13:47,788] INFO [GroupCoordinator 2]: Unloading group 
> metadata for ConsumerGroup with generation 3041{noformat}
> At this point kafka-2 is shutdown and restarted successfully. Consumers 
> continue to function fine. Once kafka-2 is back online we see this trace from 
> kafka-1 
> {noformat}
>  [2018-02-20 09:49:30,486] INFO [GroupCoordinator 1]: Unloading group 
> metadata for ConsumerGroup with generation 3041{noformat}
> At this point the consumers go into a loop of "Discovered coordinator 
> Kafka-2"Marking the coordinator Kafka-2 dead". This preempts the heartbeat 
> timer and we even see the heartbeat rate metrics drop to 0. This continues 
> until kafka-2 has finished processing offset data and finally traces
> {noformat}
>  [2018-02-20 10:52:28,956] INFO [GroupCoordinator 2]: Loading group metadata 
> for ConsumerGroup with generation 3041 
> (kafka.coordinator.GroupCoordinator){noformat}
> What seems like a bug to me is that kafka-1 is unloading the consumer group 
> long before kafka-2 is ready to load it. This seems to leave the group in an 
> unusable state, with offset commits failing because they are trying to commit 
> to kafka-2, but kafka-2 keeps responding that it isn't the group coordinator. 
> There is no coordinator for an hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to