[ https://issues.apache.org/jira/browse/KAFKA-16954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lianet Magrans updated KAFKA-16954: ----------------------------------- Fix Version/s: 3.8.0 > Move consumer leave operations on close to background thread > ------------------------------------------------------------ > > Key: KAFKA-16954 > URL: https://issues.apache.org/jira/browse/KAFKA-16954 > Project: Kafka > Issue Type: Bug > Components: clients, consumer > Reporter: Lianet Magrans > Assignee: Lianet Magrans > Priority: Blocker > Fix For: 3.8.0 > > > When a consumer unsubscribes, the app thread simply triggers an Unsubscribe > event that will take care of it all in the background thread: release > assignment (callbacks), clear assigned partitions, and send leave group HB. > On the contrary, when a consumer is closed, these actions happen in both > threads: > * release assignment -> in the app thread by directly running the callbacks > * clear assignment -> in app thread by updating the subscriptionState > * send leave group HB -> in the background thread via an event LeaveOnClose > This situation could lead to race conditions, mainly because of the close > updating the subscription state in the app thread, when other operations in > the background could be already running based on it. Ex. > * unsubscribe in app thread (triggers background UnsubscribeEvent to revoke > and leave) > * unsubscribe fails (ex. interrupted, leaving operation running in the > background thread to revoke partitions and leave) > * consumer close (will revoke and clear assignment in the app thread) > * UnsubscribeEvent in the background may fail by trying to revoke > partitions that it does not own anymore - _No current assignment for > partition ..._ > A basic check has been added to the background thread revocation to avoid the > race condition, ensuring that we only revoke partitions we own, but still we > should avoid the root cause, which is updating the assignment on the app > thread. We should consider having the close operation as a single > LeaveOnClose event handled in the background. That even already takes cares > of revoking the partitions and clearing assignment on the background, so no > need to take care of it in the app thread. We should only ensure that we > processBackgroundEvents until the LeaveOnClose completes (to allow for > callbacks to run in the app thread) > > Trying to understand the current approach, I imagine the initial motivation > to have the callabacks (and assignment cleared) in the app thread was to > avoid the back-and-forth: app thread close -> background thread leave event > -> app thread to run callback -> background thread to clear assignment and > send HB. But updating the assignment on the app thread ends up being > problematic, as it mainly happens in the background so it opens up the door > for race conditions on the subscription state. -- This message was sent by Atlassian Jira (v8.20.10#820010)