[ https://issues.apache.org/jira/browse/KAFKA-15353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755614#comment-17755614 ]
Calvin Liu commented on KAFKA-15353: ------------------------------------ Hey [~showuon], thanks for pointing it out! Will work on the fix. > Empty ISR returned from controller after AlterPartition request > --------------------------------------------------------------- > > Key: KAFKA-15353 > URL: https://issues.apache.org/jira/browse/KAFKA-15353 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 3.5.0 > Reporter: Luke Chen > Priority: Blocker > Fix For: 3.6.0, 3.5.2 > > > In > [KIP-903|https://cwiki.apache.org/confluence/display/KAFKA/KIP-903%3A+Replicas+with+stale+broker+epoch+should+not+be+allowed+to+join+the+ISR], > (more specifically this [PR|https://github.com/apache/kafka/pull/13408]), we > bumped the AlterPartitionRequest version to 3 to use `NewIsrWithEpochs` field > instead of `NewIsr` one. And when building the request for older version, > we'll manually convert/downgrade the request into the older version for > backward compatibility > [here|https://github.com/apache/kafka/blob/6bd17419b76f8cf8d7e4a11c071494dfaa72cd50/clients/src/main/java/org/apache/kafka/common/requests/AlterPartitionRequest.java#L85-L96], > to extract ISR info from `NewIsrWithEpochs` and then fill in the `NewIsr` > field, and then clear the `NewIsrWithEpochs` field. > > The problem is, when the AlterPartitionRequest sent out for the first time, > if there's some transient error (ex: NOT_CONTROLLER), we'll retry. On the > retry, we'll build the AlterPartitionRequest again. But this time, the > request data is the one that already converted above. At this point, when we > try to extract the ISR from `NewIsrWithEpochs`, we'll get empty. So, we'll > send out an AlterPartition request with empty ISR, and impacting the kafka > availability. > > From the log, I can see this: > {code:java} > [2023-08-16 03:57:55,122] INFO [Partition test_topic-1 broker=3] ISR updated > to (under-min-isr) and version updated to 9 (kafka.cluster.Partition) > ... > [2023-08-16 03:57:55,157] ERROR [ReplicaManager broker=3] Error processing > append operation on partition test_topic-1 > (kafka.server.ReplicaManager)org.apache.kafka.common.errors.NotEnoughReplicasException: > The size of the current ISR Set() is insufficient to satisfy the min.isr > requirement of 2 for partition test_topic-1 {code} > > h4. *Impact:* > This will happen when users trying to upgrade from versions < 3.5.0 to 3.5.0 > or later. During the rolling upgrade, there will be some nodes in v3.5.0, and > some are not. So, for the node in v3.5.0 will try to build an old version of > AlterPartitionRequest. And then, if it happen to have some transient error > during the AlterPartitionRequest send, the ISR will be empty and no producers > will be able to write data to the partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)