[ 
https://issues.apache.org/jira/browse/KAFKA-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685961#comment-16685961
 ] 

Boyang Chen commented on KAFKA-7610:
------------------------------------

[~hachikuji] I see your point.

> a simple way to limit the memory from unknown group members is to not store 
> the subscription until the first JoinGroup arrives using the generated 
> memberId

The issue is that right now we are fencing real "unknown member id" when the 
given member id in jg request is not within the member list. So the question 
becomes "how do we know this consumer has visited and we already allocate a new 
member id for it". Any idea other than storing this allocated id information 
else where? 

> If we want to protect the overall size of the group, perhaps a configuration 
>would be more effective? For example, `group.max.size` or something like that.

group.max.size is a good approach to limit the memory usage, however I'm just 
wondering whether this would create inconvenience to the user in case they need 
to scale up larger than group.max.size. What would be the expected behavior 
when we reach the member size limit, are we just refusing any new member join 
request then?

> Detect consumer failures in initial JoinGroup
> ---------------------------------------------
>
>                 Key: KAFKA-7610
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7610
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jason Gustafson
>            Priority: Major
>
> The session timeout and heartbeating logic in the consumer allow us to detect 
> failures after a consumer joins the group. However, we have no mechanism to 
> detect failures during a consumer's initial JoinGroup when its memberId is 
> empty. When a client fails (e.g. due to a disconnect), the newly created 
> MemberMetadata will be left in the group metadata cache. Typically when this 
> happens, the client simply retries the JoinGroup. Every retry results in a 
> new dangling member created and left in the group. These members are doomed 
> to a session timeout when the group finally finishes the rebalance, but 
> before that time, they are occupying memory. In extreme cases, when a 
> rebalance is delayed (possibly due to a buggy application), this cycle can 
> repeat and the cache can grow quite large.
> There are a couple options that come to mind to fix the problem:
> 1. During the initial JoinGroup, we can detect failed members when the TCP 
> connection fails. This is difficult at the moment because we do not have a 
> mechanism to propagate disconnects from the network layer. A potential option 
> is to treat the disconnect as just another type of request and pass it to the 
> handlers through the request queue.
> 2. Rather than holding the JoinGroup in purgatory for an indefinite amount of 
> time, we can return earlier with the generated memberId and an error code 
> (say REBALANCE_IN_PROGRESS) to indicate that retry is needed to complete the 
> rebalance. The consumer can then poll for the rebalance using its assigned 
> memberId. And we can detect failures through the session timeout. Obviously 
> this option requires a KIP (and some more thought).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to