William Reynolds created KAFKA-10105:
----------------------------------------
Summary: Regression in group coordinator dealing with flaky
clients joining while leaving
Key: KAFKA-10105
URL: https://issues.apache.org/jira/browse/KAFKA-10105
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 2.4.1
Environment: Kafka 1.1.0 on jre 8 on debian 9 in docker
Kafka 2.4.1 on jre 11 on debian 9 in docker
Reporter: William Reynolds
Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals
correctly with a consumer sending a join after a leave correctly.
What happens no is that if a consumer sends a leaving then follows up by trying
to send a join again as it is shutting down the group coordinator adds the
leaving member to the group but never seems to heartbeat that member.
Since the consumer is then gone when it joins again after starting it is added
as a new member but the zombie member is there and is included in the partition
assignment which means that those partitions never get consumed from. What can
also happen is that one of the zombies gets group leader so rebalance gets
stuck forever and the group is entirely blocked.
I have not been able to track down where this got introduced between 1.1.0 and
2.4.1 but I will look further into this. Unfortunately the logs are essentially
silent about the zombie mebers and I only had INFO level logging on during the
issue and by stopping all the consumers in the group and restarting the broker
coordinating that group we could get back to a working state.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)