[ 
https://issues.apache.org/jira/browse/KAFKA-20041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu resolved KAFKA-20041.
--------------------------------
    Resolution: Invalid

The current code prevents the race described in the ticket. Wrong diagnostic. 

> Stuck ISR expansion due to partition reassignment completion race
> -----------------------------------------------------------------
>
>                 Key: KAFKA-20041
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20041
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Calvin Liu
>            Assignee: Calvin Liu
>            Priority: Major
>
> An ISR expansion is stuck at the leader side from [0,1] -> [0,1,2]. This ISR 
> expansion can't complete because the replica set has been changed from  
> [0,1,2] -> [0,1,3]. This ISR expansion fails with INVALID_REQUEST for its 
> AlterPartition request but its 
> PendingExpandIsr stays which blocks future ISR expansion.
>  
> The main reason is a rare race between the ISR expansion and partition 
> reassignment.
> {code:java}
> private def maybeExpandIsr(followerReplica: Replica): Unit = {
>   val needsIsrUpdate = !partitionState.isInflight && 
> canAddReplicaToIsr(followerReplica.brokerId) && 
> inReadLock(leaderIsrUpdateLock) {
>     needsExpandIsr(followerReplica)
>   }
>   if (needsIsrUpdate) {
>     val alterIsrUpdateOpt = inWriteLock(leaderIsrUpdateLock) {
>       // check if this replica needs to be added to the ISR
>       partitionState match {
>         case currentState: CommittedPartitionState if 
> needsExpandIsr(followerReplica) =>
>           Some(prepareIsrExpand(currentState, followerReplica.brokerId))
>         case _ =>
>           None
>       }
>     } {code}
> The partition is expending its ISR, and it enters `maybeExpandIsr`. Before 
> this thread acquires the `leaderIsrUpdateLock`, the partition reassignment is 
> completed and the partition finished the update (it now has the latest 
> partition epochs). Then this thread enters the lock and prepares the ISR 
> expansion. Because the code trusts the caller, it does not verify whether the 
> ISR candidate replica is still in the partition replica set. Then the 
> partition creates an invalid ISR update (wrong replica) but with the valid 
> epochs. At the end, the partition receives INVALID_REQUEST error, but it does 
> not clean the PendingExpandIsr. This PendingExpandIsr prevents future ISR 
> update.
> The partition is unblocked after leader restart.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to