[jira] [Created] (KAFKA-7610) Detect consumer failures in initial JoinGroup

Jason Gustafson (JIRA) Fri, 09 Nov 2018 23:40:10 -0800

Jason Gustafson created KAFKA-7610:
--------------------------------------

             Summary: Detect consumer failures in initial JoinGroup
                 Key: KAFKA-7610
                 URL: https://issues.apache.org/jira/browse/KAFKA-7610
             Project: Kafka
          Issue Type: Improvement
            Reporter: Jason Gustafson



The session timeout and heartbeating logic in the consumer allow us to detect 
failures after a consumer joins the group. However, we have no mechanism to 
detect failures during a consumer's initial JoinGroup when its memberId is 
empty. When a client fails (e.g. due to a disconnect), the newly created 
MemberMetadata will be left in the group metadata cache. Typically when this 
happens, the client simply retries the JoinGroup. Every retry results in a new 
dangling member created and left in the group. These members are doomed to a 
session timeout when the group finally finishes the rebalance, but before that 
time, they are occupying memory. In extreme cases, when a rebalance is delayed 
(possibly due to a buggy application), this cycle can repeat and the cache can 
grow quite large.

There are a couple options that come to mind to fix the problem:

1. During the initial JoinGroup, we can detect failed members when the TCP 
connection fails. This is difficult at the moment because we do not have a 
mechanism to propagate disconnects from the network layer. A potential option 
is to treat the disconnect as just another type of request and pass it to the 
handlers through the request queue.

2. Rather than holding the JoinGroup in purgatory for an indefinite amount of 
time, we can return earlier with the generated memberId and an error code (say 
REBALANCE_IN_PROGRESS) to indicate that retry is needed to complete the 
rebalance. The consumer can then poll for the rebalance using its assigned 
memberId. And we can detect failures through the session timeout. Obviously 
this option requires a KIP (and some more thought).








--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (KAFKA-7610) Detect consumer failures in initial JoinGroup

Reply via email to