[ https://issues.apache.org/jira/browse/KAFKA-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132599#comment-17132599 ]
Brian McKelvey commented on KAFKA-10105: ---------------------------------------- Yes, this is a big problem in a few ways: 1.) If a consumer process is forcefully killed without sending a `leave_group` call, it remains in the consumer group in the broker, despite missed heartbeats. 2.) If a consumer sends a join_group call, but that call runs into the client's socket timeout and is retried, the group member created in the first call remains stuck as a zombie in the group, again despite missing heartbeats. This has happened a lot for us when using Zendesk's ruby-kafka client library, which handles most Kafka interactions on the main thread. So if an active ruby-kafka consumer is chosen as the group leader and it is in the middle of processing a previously fetched message and it takes too long to get back to the operations related to group membership (heartbeat, sync_group, etc.), it can cause the `join_group` call for a new member to time out. (The default socket timeout in ruby-kafka is 10 seconds). cc: [~dasch] > Regression in group coordinator dealing with flaky clients joining while > leaving > -------------------------------------------------------------------------------- > > Key: KAFKA-10105 > URL: https://issues.apache.org/jira/browse/KAFKA-10105 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.4.1 > Environment: Kafka 1.1.0 on jre 8 on debian 9 in docker > Kafka 2.4.1 on jre 11 on debian 9 in docker > Reporter: William Reynolds > Priority: Major > > Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals > correctly with a consumer sending a join after a leave correctly. > What happens no is that if a consumer sends a leaving then follows up by > trying to send a join again as it is shutting down the group coordinator adds > the leaving member to the group but never seems to heartbeat that member. > Since the consumer is then gone when it joins again after starting it is > added as a new member but the zombie member is there and is included in the > partition assignment which means that those partitions never get consumed > from. What can also happen is that one of the zombies gets group leader so > rebalance gets stuck forever and the group is entirely blocked. > I have not been able to track down where this got introduced between 1.1.0 > and 2.4.1 but I will look further into this. Unfortunately the logs are > essentially silent about the zombie mebers and I only had INFO level logging > on during the issue and by stopping all the consumers in the group and > restarting the broker coordinating that group we could get back to a working > state. -- This message was sent by Atlassian Jira (v8.3.4#803005)