[ 
https://issues.apache.org/jira/browse/KAFKA-15353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754976#comment-17754976
 ] 

Luke Chen commented on KAFKA-15353:
-----------------------------------

I'm setting this issue as blocker for v3.5.2 and v3.6.0. Let me know if you 
have any thoughts.

Sorry that I'm going to attend a conference the following days, so I can't 
submit patch for this issue soon. Welcome to take it over if anyone has 
available cycle. Thanks.

cc [~calvinliu] [~junrao] [~dajac] 

 

 

> Empty ISR returned from controller after AlterPartition request
> ---------------------------------------------------------------
>
>                 Key: KAFKA-15353
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15353
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.5.0
>            Reporter: Luke Chen
>            Priority: Blocker
>             Fix For: 3.6.0, 3.5.2
>
>
> In 
> [KIP-903|https://cwiki.apache.org/confluence/display/KAFKA/KIP-903%3A+Replicas+with+stale+broker+epoch+should+not+be+allowed+to+join+the+ISR],
>  (more specifically this [PR|https://github.com/apache/kafka/pull/13408]), we 
> bumped the AlterPartitionRequest version to 3 to use `NewIsrWithEpochs` field 
> instead of `NewIsr` one. And when building the request, we'll manually 
> convert/downgrade the request into the older version for backward 
> compatibility 
> [here|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/requests/AlterPartitionRequest.java#L85-L96],
>  to extract ISR info from `NewIsrWithEpochs` and then fill in the `NewIsr` 
> field, and then clear the `NewIsrWithEpochs` field.
>  
> The problem is, when the AlterPartitionRequest sent out for the first time, 
> there's some transient error (ex: NOT_CONTROLLER), we'll retry. On the retry, 
> we'll build the AlterPartitionRequest again. But this time, the request data 
> is the one that converted above. At this point, when we try to extract the 
> ISR from `NewIsrWithEpochs`, we'll get empty. So, we'll send out an 
> AlterPartition request with empty ISR, and impacting the kafka availability. 
> From the log, I can see this:
> {code:java}
> [2023-08-16 03:57:55,122] INFO [Partition test_topic-1 broker=3] ISR updated 
> to  (under-min-isr) and version updated to 9 (kafka.cluster.Partition)
> {code}
>  
> This will happen when users trying to upgrade from versions < 3.5.0 to 3.5.0 
> or later. During the rolling upgrade, there will be some nodes in v3.5.0, and 
> some are not. So, for the node in v3.5.0 will try to build an old version of 
> AlterPartitionRequest. And then, if it happen to have some transient error 
> during the AlterPartitionRequest send, the ISR will be empty and no producers 
> will be able to write data to the partitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to