[
https://issues.apache.org/jira/browse/KAFKA-20041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Calvin Liu resolved KAFKA-20041.
--------------------------------
Resolution: Invalid
The current code prevents the race described in the ticket. Wrong diagnostic.
> Stuck ISR expansion due to partition reassignment completion race
> -----------------------------------------------------------------
>
> Key: KAFKA-20041
> URL: https://issues.apache.org/jira/browse/KAFKA-20041
> Project: Kafka
> Issue Type: Bug
> Reporter: Calvin Liu
> Assignee: Calvin Liu
> Priority: Major
>
> An ISR expansion is stuck at the leader side from [0,1] -> [0,1,2]. This ISR
> expansion can't complete because the replica set has been changed from
> [0,1,2] -> [0,1,3]. This ISR expansion fails with INVALID_REQUEST for its
> AlterPartition request but its
> PendingExpandIsr stays which blocks future ISR expansion.
>
> The main reason is a rare race between the ISR expansion and partition
> reassignment.
> {code:java}
> private def maybeExpandIsr(followerReplica: Replica): Unit = {
> val needsIsrUpdate = !partitionState.isInflight &&
> canAddReplicaToIsr(followerReplica.brokerId) &&
> inReadLock(leaderIsrUpdateLock) {
> needsExpandIsr(followerReplica)
> }
> if (needsIsrUpdate) {
> val alterIsrUpdateOpt = inWriteLock(leaderIsrUpdateLock) {
> // check if this replica needs to be added to the ISR
> partitionState match {
> case currentState: CommittedPartitionState if
> needsExpandIsr(followerReplica) =>
> Some(prepareIsrExpand(currentState, followerReplica.brokerId))
> case _ =>
> None
> }
> } {code}
> The partition is expending its ISR, and it enters `maybeExpandIsr`. Before
> this thread acquires the `leaderIsrUpdateLock`, the partition reassignment is
> completed and the partition finished the update (it now has the latest
> partition epochs). Then this thread enters the lock and prepares the ISR
> expansion. Because the code trusts the caller, it does not verify whether the
> ISR candidate replica is still in the partition replica set. Then the
> partition creates an invalid ISR update (wrong replica) but with the valid
> epochs. At the end, the partition receives INVALID_REQUEST error, but it does
> not clean the PendingExpandIsr. This PendingExpandIsr prevents future ISR
> update.
> The partition is unblocked after leader restart.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)