William Reynolds created KAFKA-10105:
----------------------------------------

             Summary: Regression in group coordinator dealing with flaky 
clients joining while leaving
                 Key: KAFKA-10105
                 URL: https://issues.apache.org/jira/browse/KAFKA-10105
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 2.4.1
         Environment: Kafka 1.1.0 on jre 8 on debian 9 in docker
Kafka 2.4.1 on jre 11 on debian 9 in docker
            Reporter: William Reynolds


Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals 
correctly with a consumer sending a join after a leave correctly.

What happens no is that if a consumer sends a leaving then follows up by trying 
to send a join again as it is shutting down the group coordinator adds the 
leaving member to the group but never seems to heartbeat that member.

Since the consumer is then gone when it joins again after starting it is added 
as a new member but the zombie member is there and is included in the partition 
assignment which means that those partitions never get consumed from. What can 
also happen is that one of the zombies gets group leader so rebalance gets 
stuck forever and the group is entirely blocked.

I have not been able to track down where this got introduced between 1.1.0 and 
2.4.1 but I will look further into this. Unfortunately the logs are essentially 
silent about the zombie mebers and I only had INFO level logging on during the 
issue and by stopping all the consumers in the group and restarting the broker 
coordinating that group we could get back to a working state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to