[
https://issues.apache.org/jira/browse/KAFKA-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luke Chen resolved KAFKA-16247.
-------------------------------
Resolution: Fixed
Fixed in 3.7.0 RC4
> replica keep out-of-sync after migrating broker to KRaft
> --------------------------------------------------------
>
> Key: KAFKA-16247
> URL: https://issues.apache.org/jira/browse/KAFKA-16247
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 3.7.0
> Reporter: Luke Chen
> Priority: Major
> Attachments: KAFKA-16247.zip
>
>
> We are deploying 3 controllers and 3 brokers, and following the steps in
> [doc|https://kafka.apache.org/documentation/#kraft_zk_migration]. When we're
> moving from "Enabling the migration on the brokers" state to "Migrating
> brokers to KRaft" state, the first rolled broker becomes out-of-sync and
> never become in-sync.
> From the log, we can see some "reject alterPartition" errors, but it just
> happen 2 times. Theoretically, the leader should add the follower into ISR
> as long as the follower is fetching since we don't have client writing data.
> But can't figure out why it didn't fetch.
> Logs: https://gist.github.com/showuon/64c4dcecb238a317bdbdec8db17fd494
> ===
> update Feb. 14
> After further investigating the logs, I think the reason why the replica is
> not added into ISR is because the alterPartition request got non-retriable
> error from controller:
> {code:java}
> Failed to alter partition to PendingExpandIsr(newInSyncReplicaId=0,
> sentLeaderAndIsr=LeaderAndIsr(leader=1, leaderEpoch=4,
> isrWithBrokerEpoch=List(BrokerState(brokerId=1, brokerEpoch=-1),
> BrokerState(brokerId=2, brokerEpoch=-1), BrokerState(brokerId=0,
> brokerEpoch=-1)), leaderRecoveryState=RECOVERED, partitionEpoch=7),
> leaderRecoveryState=RECOVERED,
> lastCommittedState=CommittedPartitionState(isr=Set(1, 2),
> leaderRecoveryState=RECOVERED)) because the partition epoch is invalid.
> Partition state may be out of sync, awaiting new the latest metadata.
> (kafka.cluster.Partition)
> [zk-broker-1-to-controller-alter-partition-channel-manager]
> {code}
> Since it's a non-retriable error, we'll keep the state as pending, and
> waiting for later leaderAndISR update as described
> [here|https://github.com/apache/kafka/blob/d24abe0edebad37e554adea47408c3063037f744/core/src/main/scala/kafka/cluster/Partition.scala#L1876C1-L1876C41].
> Log analysis: https://gist.github.com/showuon/5514cbb995fc2ae6acd5858f69c137bb
> So the question becomes:
> 1. Why does the controller increase the partition epoch?
> 2. When the leader receives the leaderAndISR request from the controller, it
> ignored the request because the leader epoch is identical, even though the
> partition epoch is updated. Is the behavior expected? Will it impact the
> alterPartition request later?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)