[jira] [Commented] (KAFKA-7610) Detect consumer failures in initial JoinGroup

Boyang Chen (JIRA) Fri, 16 Nov 2018 11:25:12 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689916#comment-16689916
 ]


Boyang Chen commented on KAFKA-7610:
------------------------------------

Sounds good Jason! I think combining `group.max.size` and member id requirement 
in join group request should be sufficient for the solving the above scenario. 
I will open a separate Jira for group size, in the meanwhile I think the 
conclusion is that Jason's original proposal 2 should be sufficient to mitigate 
the problem.

> Detect consumer failures in initial JoinGroup
> ---------------------------------------------
>
>                 Key: KAFKA-7610
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7610
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jason Gustafson
>            Priority: Major
>
> The session timeout and heartbeating logic in the consumer allow us to detect 
> failures after a consumer joins the group. However, we have no mechanism to 
> detect failures during a consumer's initial JoinGroup when its memberId is 
> empty. When a client fails (e.g. due to a disconnect), the newly created 
> MemberMetadata will be left in the group metadata cache. Typically when this 
> happens, the client simply retries the JoinGroup. Every retry results in a 
> new dangling member created and left in the group. These members are doomed 
> to a session timeout when the group finally finishes the rebalance, but 
> before that time, they are occupying memory. In extreme cases, when a 
> rebalance is delayed (possibly due to a buggy application), this cycle can 
> repeat and the cache can grow quite large.
> There are a couple options that come to mind to fix the problem:
> 1. During the initial JoinGroup, we can detect failed members when the TCP 
> connection fails. This is difficult at the moment because we do not have a 
> mechanism to propagate disconnects from the network layer. A potential option 
> is to treat the disconnect as just another type of request and pass it to the 
> handlers through the request queue.
> 2. Rather than holding the JoinGroup in purgatory for an indefinite amount of 
> time, we can return earlier with the generated memberId and an error code 
> (say REBALANCE_IN_PROGRESS) to indicate that retry is needed to complete the 
> rebalance. The consumer can then poll for the rebalance using its assigned 
> memberId. And we can detect failures through the session timeout. Obviously 
> this option requires a KIP (and some more thought).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-7610) Detect consumer failures in initial JoinGroup

Reply via email to