[jira] [Created] (KAFKA-18911) alterPartition gets stuck when getting out-of-date errors

Luke Chen (Jira) Mon, 03 Mar 2025 01:56:04 -0800

Luke Chen created KAFKA-18911:
---------------------------------

             Summary: alterPartition gets stuck when getting out-of-date errors
                 Key: KAFKA-18911
                 URL: https://issues.apache.org/jira/browse/KAFKA-18911
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 3.9.0
            Reporter: Luke Chen
            Assignee: Luke Chen

When the leader node sends the AlterPartition request to the controller, the
controller will do [some
validation|https://github.com/apache/kafka/blob/898dcd11ad260e9b3cfefc5291c40e68009acb7d/metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java#L1231]
before processing it. And in the leader node side, when receiving the errors,
we'll decide if it should be retried or not
[here|https://github.com/apache/kafka/blob/898dcd11ad260e9b3cfefc5291c40e68009acb7d/core/src/main/scala/kafka/cluster/Partition.scala#L1868].
However, in some non-retry cases, we directly return false without changing
the state:

{code:java}
case Errors.UNKNOWN_TOPIC_OR_PARTITION =>
info(s"Failed to alter partition to $proposedIsrState since the controller
doesn't know about " +
"this topic or partition. Partition state may be out of sync, awaiting new
the latest metadata.")
false
case Errors.UNKNOWN_TOPIC_ID =>
info(s"Failed to alter partition to $proposedIsrState since the controller
doesn't know about " +
"this topic. Partition state may be out of sync, awaiting new the latest
metadata.")
false
case Errors.FENCED_LEADER_EPOCH =>
info(s"Failed to alter partition to $proposedIsrState since the leader epoch
is old. " +
"Partition state may be out of sync, awaiting new the latest metadata.")
false
case Errors.INVALID_UPDATE_VERSION =>
info(s"Failed to alter partition to $proposedIsrState because the partition
epoch is invalid. " +
"Partition state may be out of sync, awaiting new the latest metadata.")
false
case Errors.INVALID_REQUEST =>
info(s"Failed to alter partition to $proposedIsrState because the request is
invalid. " +
"Partition state may be out of sync, awaiting new the latest metadata.")
false
case Errors.NEW_LEADER_ELECTED =>
// The operation completed successfully but this replica got removed from the
replica set by the controller
// while completing a ongoing reassignment. This replica is no longer the
leader but it does not know it
// yet. It should remain in the current pending state until the metadata
overrides it.
// This is only raised in KRaft mode.
info(s"The alter partition request successfully updated the partition state
to $proposedIsrState but " +
"this replica got removed from the replica set while completing a
reassignment. " +
"Waiting on new metadata to clean up this replica.")
false{code}
As we said in the log, "Partition state may be out of sync, awaiting new the
latest metadata". But without updating the partition state means it will stays
at `PendingExpandIsr` or `PendingShrinkIsr` state, which keeps the `isInflight`
to true. Under this state, the partition state will never be updated anymore.

The impact of this issue is that the ISR state will be in stale(wrong) state
until leadership change.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-18911) alterPartition gets stuck when getting out-of-date errors

Reply via email to