[
https://issues.apache.org/jira/browse/KAFKA-18911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luke Chen resolved KAFKA-18911.
-------------------------------
Resolution: Invalid
> alterPartition gets stuck when getting out-of-date errors
> ---------------------------------------------------------
>
> Key: KAFKA-18911
> URL: https://issues.apache.org/jira/browse/KAFKA-18911
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 3.9.0
> Reporter: Luke Chen
> Assignee: Luke Chen
> Priority: Major
>
> When the leader node sends the AlterPartition request to the controller, the
> controller will do [some
> validation|https://github.com/apache/kafka/blob/898dcd11ad260e9b3cfefc5291c40e68009acb7d/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L1231]
> before processing it. And in the leader node side, when receiving the
> errors, we'll decide if it should be retried or not
> [here|https://github.com/apache/kafka/blob/898dcd11ad260e9b3cfefc5291c40e68009acb7d/core/src/main/scala/kafka/cluster/Partition.scala#L1868].
> However, in some non-retry cases, we directly return false without changing
> the state:
>
> {code:java}
> case Errors.UNKNOWN_TOPIC_OR_PARTITION =>
> info(s"Failed to alter partition to $proposedIsrState since the controller
> doesn't know about " +
> "this topic or partition. Partition state may be out of sync, awaiting
> new the latest metadata.")
> false
> case Errors.UNKNOWN_TOPIC_ID =>
> info(s"Failed to alter partition to $proposedIsrState since the controller
> doesn't know about " +
> "this topic. Partition state may be out of sync, awaiting new the latest
> metadata.")
> false
> case Errors.FENCED_LEADER_EPOCH =>
> info(s"Failed to alter partition to $proposedIsrState since the leader
> epoch is old. " +
> "Partition state may be out of sync, awaiting new the latest metadata.")
> false
> case Errors.INVALID_UPDATE_VERSION =>
> info(s"Failed to alter partition to $proposedIsrState because the partition
> epoch is invalid. " +
> "Partition state may be out of sync, awaiting new the latest metadata.")
> false
> case Errors.INVALID_REQUEST =>
> info(s"Failed to alter partition to $proposedIsrState because the request
> is invalid. " +
> "Partition state may be out of sync, awaiting new the latest metadata.")
> false
> case Errors.NEW_LEADER_ELECTED =>
> // The operation completed successfully but this replica got removed from
> the replica set by the controller
> // while completing a ongoing reassignment. This replica is no longer the
> leader but it does not know it
> // yet. It should remain in the current pending state until the metadata
> overrides it.
> // This is only raised in KRaft mode.
> info(s"The alter partition request successfully updated the partition state
> to $proposedIsrState but " +
> "this replica got removed from the replica set while completing a
> reassignment. " +
> "Waiting on new metadata to clean up this replica.")
> false{code}
> As we said in the log, "Partition state may be out of sync, awaiting new the
> latest metadata". But without updating the partition state means it will
> stays at `PendingExpandIsr` or `PendingShrinkIsr` state, which keeps the
> `isInflight` to true. Under this state, the partition state will never be
> updated anymore.
>
> The impact of this issue is that the ISR state will be in stale(wrong) state
> until leadership change.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)