Rob Gevers created KAFKA-6671:
---------------------------------

             Summary: Consumer group coordinator releases group before new 
coordinator is ready.
                 Key: KAFKA-6671
                 URL: https://issues.apache.org/jira/browse/KAFKA-6671
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.10.2.1
            Reporter: Rob Gevers


We regularly have an issue with our Kafka deploys which causes consumers to be 
unable to consume for an extended period of time (up to an hour) after the 
deploy finishes. The issue appears to be a side-effect of the way consumer 
group coordination is managed between nodes. A sample timeline of a deploy 
looks like the following:

We initiate a clean shutdown of a node (which we will call kafka-2). We see 
these traces:
{noformat}
 [2018-02-20 09:13:46,935] INFO [GroupCoordinator 1]: Loading group metadata 
for ConsumerGroup with generation 3041 
(kafka.coordinator.GroupCoordinator){noformat}
{noformat}
 [2018-02-20 09:13:47,788] INFO [GroupCoordinator 2]: Unloading group metadata 
for ConsumerGroup with generation 3041{noformat}
At this point kafka-2 is shutdown and restarted successfully. Consumers 
continue to function fine. Once kafka-2 is back online we see this trace from 
kafka-1 
{noformat}
 [2018-02-20 09:49:30,486] INFO [GroupCoordinator 1]: Unloading group metadata 
for ConsumerGroup with generation 3041{noformat}
At this point the consumers go into a loop of "Discovered coordinator 
Kafka-2"Marking the coordinator Kafka-2 dead". This preempts the heartbeat 
timer and we even see the heartbeat rate metrics drop to 0. This continues 
until kafka-2 has finished processing offset data and finally traces
{noformat}
 [2018-02-20 10:52:28,956] INFO [GroupCoordinator 2]: Loading group metadata 
for ConsumerGroup with generation 3041 
(kafka.coordinator.GroupCoordinator){noformat}
What seems like a bug to me is that kafka-1 is unloading the consumer group 
long before kafka-2 is ready to load it. This seems to leave the group in an 
unusable state, with offset commits failing because they are trying to commit 
to kafka-2, but kafka-2 keeps responding that it isn't the group coordinator. 
There is no coordinator for an hour.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to