[
https://issues.apache.org/jira/browse/KAFKA-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guozhang Wang resolved KAFKA-7610.
----------------------------------
Resolution: Fixed
Assignee: Jason Gustafson (was: Boyang Chen)
Fix Version/s: 2.1.1
2.2.0
> Detect consumer failures in initial JoinGroup
> ---------------------------------------------
>
> Key: KAFKA-7610
> URL: https://issues.apache.org/jira/browse/KAFKA-7610
> Project: Kafka
> Issue Type: Improvement
> Components: consumer
> Reporter: Jason Gustafson
> Assignee: Jason Gustafson
> Priority: Major
> Fix For: 2.2.0, 2.1.1
>
>
> The session timeout and heartbeating logic in the consumer allow us to detect
> failures after a consumer joins the group. However, we have no mechanism to
> detect failures during a consumer's initial JoinGroup when its memberId is
> empty. When a client fails (e.g. due to a disconnect), the newly created
> MemberMetadata will be left in the group metadata cache. Typically when this
> happens, the client simply retries the JoinGroup. Every retry results in a
> new dangling member created and left in the group. These members are doomed
> to a session timeout when the group finally finishes the rebalance, but
> before that time, they are occupying memory. In extreme cases, when a
> rebalance is delayed (possibly due to a buggy application), this cycle can
> repeat and the cache can grow quite large.
> There are a couple options that come to mind to fix the problem:
> 1. During the initial JoinGroup, we can detect failed members when the TCP
> connection fails. This is difficult at the moment because we do not have a
> mechanism to propagate disconnects from the network layer. A potential option
> is to treat the disconnect as just another type of request and pass it to the
> handlers through the request queue.
> 2. Rather than holding the JoinGroup in purgatory for an indefinite amount of
> time, we can return earlier with the generated memberId and an error code
> (say REBALANCE_IN_PROGRESS) to indicate that retry is needed to complete the
> rebalance. The consumer can then poll for the rebalance using its assigned
> memberId. And we can detect failures through the session timeout. Obviously
> this option requires a KIP (and some more thought).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)