[jira] [Comment Edited] (KAFKA-7610) Detect consumer failures in initial JoinGroup

Boyang Chen (JIRA) Sat, 10 Nov 2018 09:19:10 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682509#comment-16682509
 ]


Boyang Chen edited comment on KAFKA-7610 at 11/10/18 5:18 PM:
--------------------------------------------------------------

Thanks Jason for proposing this issue. I think in static membership 
([KIP-345|https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances])
 we should be able to address this issue by remembering the newly joined 
member's name, even after multiple disconnects we will still have the exact 
same member.

No matter what approach we eventually take, it requires authentication of 
client identity. I'm in favor of the approach 2 so far on dynamic membership, 
and the key point I'm trying to understand is what if the response message 
failed on the way, which will lead to another "unknown member" join for a new 
consumer as I suppose? Or we believe one or two failed responses shouldn't 
matter because our goal here is to avoid cache burst.

Thanks!


was (Author: bchen225242):
Thanks Jason for proposing this issue. I think in static membership 
([KIP-345|https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances])
 we should be able to address this issue by remembering the newly joined 
member's name, even after multiple disconnects we will still have the exact 
same member.

This by far I feel is the easiest solution since no matter what approach we 
eventually take, it requires authentication of client identity. I'm in favor of 
the approach 2 so far on dynamic membership, and the key point I'm trying to 
understand is what if the response message failed on the way, which will lead 
to another "unknown member" join for a new consumer as I suppose?

> Detect consumer failures in initial JoinGroup
> ---------------------------------------------
>
>                 Key: KAFKA-7610
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7610
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jason Gustafson
>            Priority: Major
>
> The session timeout and heartbeating logic in the consumer allow us to detect 
> failures after a consumer joins the group. However, we have no mechanism to 
> detect failures during a consumer's initial JoinGroup when its memberId is 
> empty. When a client fails (e.g. due to a disconnect), the newly created 
> MemberMetadata will be left in the group metadata cache. Typically when this 
> happens, the client simply retries the JoinGroup. Every retry results in a 
> new dangling member created and left in the group. These members are doomed 
> to a session timeout when the group finally finishes the rebalance, but 
> before that time, they are occupying memory. In extreme cases, when a 
> rebalance is delayed (possibly due to a buggy application), this cycle can 
> repeat and the cache can grow quite large.
> There are a couple options that come to mind to fix the problem:
> 1. During the initial JoinGroup, we can detect failed members when the TCP 
> connection fails. This is difficult at the moment because we do not have a 
> mechanism to propagate disconnects from the network layer. A potential option 
> is to treat the disconnect as just another type of request and pass it to the 
> handlers through the request queue.
> 2. Rather than holding the JoinGroup in purgatory for an indefinite amount of 
> time, we can return earlier with the generated memberId and an error code 
> (say REBALANCE_IN_PROGRESS) to indicate that retry is needed to complete the 
> rebalance. The consumer can then poll for the rebalance using its assigned 
> memberId. And we can detect failures through the session timeout. Obviously 
> this option requires a KIP (and some more thought).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (KAFKA-7610) Detect consumer failures in initial JoinGroup

Reply via email to