[jira] [Commented] (KAFKA-10105) Regression in group coordinator dealing with flaky clients joining while leaving

Brian McKelvey (Jira) Wed, 10 Jun 2020 11:26:20 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132599#comment-17132599
 ]


Brian McKelvey commented on KAFKA-10105:
----------------------------------------

Yes, this is a big problem in a few ways:

1.) If a consumer process is forcefully killed without sending a `leave_group` 
call, it remains in the consumer group in the broker, despite missed heartbeats.

2.) If a consumer sends a join_group call, but that call runs into the client's 
socket timeout and is retried, the group member created in the first call 
remains stuck as a zombie in the group, again despite missing heartbeats.

This has happened a lot for us when using Zendesk's ruby-kafka client library, 
which handles most Kafka interactions on the main thread. So if an active 
ruby-kafka consumer is chosen as the group leader and it is in the middle of 
processing a previously fetched message and it takes too long to get back to 
the operations related to group membership (heartbeat, sync_group, etc.), it 
can cause the `join_group` call for a new member to time out. (The default 
socket timeout in ruby-kafka is 10 seconds).

cc: [~dasch]

> Regression in group coordinator dealing with flaky clients joining while 
> leaving
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-10105
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10105
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.1
>         Environment: Kafka 1.1.0 on jre 8 on debian 9 in docker
> Kafka 2.4.1 on jre 11 on debian 9 in docker
>            Reporter: William Reynolds
>            Priority: Major
>
> Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals 
> correctly with a consumer sending a join after a leave correctly.
> What happens no is that if a consumer sends a leaving then follows up by 
> trying to send a join again as it is shutting down the group coordinator adds 
> the leaving member to the group but never seems to heartbeat that member.
> Since the consumer is then gone when it joins again after starting it is 
> added as a new member but the zombie member is there and is included in the 
> partition assignment which means that those partitions never get consumed 
> from. What can also happen is that one of the zombies gets group leader so 
> rebalance gets stuck forever and the group is entirely blocked.
> I have not been able to track down where this got introduced between 1.1.0 
> and 2.4.1 but I will look further into this. Unfortunately the logs are 
> essentially silent about the zombie mebers and I only had INFO level logging 
> on during the issue and by stopping all the consumers in the group and 
> restarting the broker coordinating that group we could get back to a working 
> state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-10105) Regression in group coordinator dealing with flaky clients joining while leaving

Reply via email to