[ 
https://issues.apache.org/jira/browse/KAFKA-15353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chen updated KAFKA-15353:
------------------------------
    Description: 
In 
[KIP-903|https://cwiki.apache.org/confluence/display/KAFKA/KIP-903%3A+Replicas+with+stale+broker+epoch+should+not+be+allowed+to+join+the+ISR],
 (more specifically this [PR|https://github.com/apache/kafka/pull/13408]), we 
bumped the AlterPartitionRequest version to 3 to use `NewIsrWithEpochs` field 
instead of `NewIsr` one. And when building the request for older version, we'll 
manually convert/downgrade the request into the older version for backward 
compatibility 
[here|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/requests/AlterPartitionRequest.java#L85-L96],
 to extract ISR info from `NewIsrWithEpochs` and then fill in the `NewIsr` 
field, and then clear the `NewIsrWithEpochs` field.

 

The problem is, when the AlterPartitionRequest sent out for the first time, 
there's some transient error (ex: NOT_CONTROLLER), we'll retry. On the retry, 
we'll build the AlterPartitionRequest again. But this time, the request data is 
the one that converted above. At this point, when we try to extract the ISR 
from `NewIsrWithEpochs`, we'll get empty. So, we'll send out an AlterPartition 
request with empty ISR, and impacting the kafka availability. From the log, I 
can see this:
{code:java}
[2023-08-16 03:57:55,122] INFO [Partition test_topic-1 broker=3] ISR updated to 
 (under-min-isr) and version updated to 9 (kafka.cluster.Partition)
{code}
 

This will happen when users trying to upgrade from versions < 3.5.0 to 3.5.0 or 
later. During the rolling upgrade, there will be some nodes in v3.5.0, and some 
are not. So, for the node in v3.5.0 will try to build an old version of 
AlterPartitionRequest. And then, if it happen to have some transient error 
during the AlterPartitionRequest send, the ISR will be empty and no producers 
will be able to write data to the partitions.

  was:
In 
[KIP-903|https://cwiki.apache.org/confluence/display/KAFKA/KIP-903%3A+Replicas+with+stale+broker+epoch+should+not+be+allowed+to+join+the+ISR],
 (more specifically this [PR|https://github.com/apache/kafka/pull/13408]), we 
bumped the AlterPartitionRequest version to 3 to use `NewIsrWithEpochs` field 
instead of `NewIsr` one. And when building the request, we'll manually 
convert/downgrade the request into the older version for backward compatibility 
[here|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/requests/AlterPartitionRequest.java#L85-L96],
 to extract ISR info from `NewIsrWithEpochs` and then fill in the `NewIsr` 
field, and then clear the `NewIsrWithEpochs` field.

 

The problem is, when the AlterPartitionRequest sent out for the first time, 
there's some transient error (ex: NOT_CONTROLLER), we'll retry. On the retry, 
we'll build the AlterPartitionRequest again. But this time, the request data is 
the one that converted above. At this point, when we try to extract the ISR 
from `NewIsrWithEpochs`, we'll get empty. So, we'll send out an AlterPartition 
request with empty ISR, and impacting the kafka availability. From the log, I 
can see this:
{code:java}
[2023-08-16 03:57:55,122] INFO [Partition test_topic-1 broker=3] ISR updated to 
 (under-min-isr) and version updated to 9 (kafka.cluster.Partition)
{code}
 

This will happen when users trying to upgrade from versions < 3.5.0 to 3.5.0 or 
later. During the rolling upgrade, there will be some nodes in v3.5.0, and some 
are not. So, for the node in v3.5.0 will try to build an old version of 
AlterPartitionRequest. And then, if it happen to have some transient error 
during the AlterPartitionRequest send, the ISR will be empty and no producers 
will be able to write data to the partitions.


> Empty ISR returned from controller after AlterPartition request
> ---------------------------------------------------------------
>
>                 Key: KAFKA-15353
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15353
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.5.0
>            Reporter: Luke Chen
>            Priority: Blocker
>             Fix For: 3.6.0, 3.5.2
>
>
> In 
> [KIP-903|https://cwiki.apache.org/confluence/display/KAFKA/KIP-903%3A+Replicas+with+stale+broker+epoch+should+not+be+allowed+to+join+the+ISR],
>  (more specifically this [PR|https://github.com/apache/kafka/pull/13408]), we 
> bumped the AlterPartitionRequest version to 3 to use `NewIsrWithEpochs` field 
> instead of `NewIsr` one. And when building the request for older version, 
> we'll manually convert/downgrade the request into the older version for 
> backward compatibility 
> [here|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/common/requests/AlterPartitionRequest.java#L85-L96],
>  to extract ISR info from `NewIsrWithEpochs` and then fill in the `NewIsr` 
> field, and then clear the `NewIsrWithEpochs` field.
>  
> The problem is, when the AlterPartitionRequest sent out for the first time, 
> there's some transient error (ex: NOT_CONTROLLER), we'll retry. On the retry, 
> we'll build the AlterPartitionRequest again. But this time, the request data 
> is the one that converted above. At this point, when we try to extract the 
> ISR from `NewIsrWithEpochs`, we'll get empty. So, we'll send out an 
> AlterPartition request with empty ISR, and impacting the kafka availability. 
> From the log, I can see this:
> {code:java}
> [2023-08-16 03:57:55,122] INFO [Partition test_topic-1 broker=3] ISR updated 
> to  (under-min-isr) and version updated to 9 (kafka.cluster.Partition)
> {code}
>  
> This will happen when users trying to upgrade from versions < 3.5.0 to 3.5.0 
> or later. During the rolling upgrade, there will be some nodes in v3.5.0, and 
> some are not. So, for the node in v3.5.0 will try to build an old version of 
> AlterPartitionRequest. And then, if it happen to have some transient error 
> during the AlterPartitionRequest send, the ISR will be empty and no producers 
> will be able to write data to the partitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to