[ 
https://issues.apache.org/jira/browse/KAFKA-16954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianet Magrans updated KAFKA-16954:
-----------------------------------
    Priority: Blocker  (was: Major)

> Move consumer leave operations on close to background thread
> ------------------------------------------------------------
>
>                 Key: KAFKA-16954
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16954
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>            Reporter: Lianet Magrans
>            Priority: Blocker
>
> When a consumer unsubscribes, the app thread simply triggers an Unsubscribe 
> event that will take care of it all in the background thread: release 
> assignment (callbacks), clear assigned partitions, and send leave group HB.
> On the contrary, when a consumer is closed, these actions happen in both 
> threads:
>  * release assignment -> in the app thread by directly running the callbacks
>  * clear assignment -> in app thread by updating the subscriptionState
>  * send leave group HB -> in the background thread via an event LeaveOnClose 
> This situation could lead to race conditions, mainly because of the close 
> updating the subscription state in the app thread, when other operations in 
> the background could be already running based on it. Ex. 
>  * unsubscribe in app thread (triggers background UnsubscribeEvent to revoke 
> and leave)
>  * unsubscribe fails (ex. interrupted, leaving operation running in the 
> background thread to revoke partitions and leave)
>  * consumer close (will revoke and clear assignment in the app thread)
>  *  UnsubscribeEvent in the background may fail by trying to revoke 
> partitions that it does not own anymore - _No current assignment for 
> partition ..._
> A basic check has been added to the background thread revocation to avoid the 
> race condition, ensuring that we only revoke partitions we own, but still we 
> should avoid the root cause, which is updating the assignment on the app 
> thread. We should consider having the close operation as a single 
> LeaveOnClose event handled in the background. That even already takes cares 
> of revoking the partitions and clearing assignment on the background, so no 
> need to take care of it in the app thread. We should only ensure that we 
> processBackgroundEvents until the LeaveOnClose completes (to allow for 
> callbacks to run in the app thread)
>  
> Trying to understand the current approach, I imagine the initial motivation 
> to have the callabacks (and assignment cleared) in the app thread was to 
> avoid the back-and-forth: app thread close -> background thread leave event 
> -> app thread to run callback -> background thread to clear assignment and 
> send HB. But updating the assignment on the app thread ends up being 
> problematic, as it mainly happens in the background so it opens up the door 
> for race conditions on the subscription state. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to