[jira] [Updated] (KAFKA-18930) KRaft MigrationEvent won't retry when failing to write data to ZK

Luke Chen (Jira) Thu, 06 Mar 2025 03:07:09 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Luke Chen updated KAFKA-18930:
------------------------------
    Description: 
When running ZK migrating to KRaft, there will be a dual-write mode. In that 
mode, metadata will write to KRaft (and then ack success to client), then write 
to ZK asynchronously. When there's some exception, KRaft MigrationEvent won't 
retry when failing to write data to ZK. That causes metadata inconsistency 
between KRaft and ZK.

 

Ex: In dual-write mode, a client tries to create a topic:
 # client sends create topic request to KRaft controller
 # KRaft controller commits this request and create the topic successfully
 # KRaft controller responds to the client with success.
 # KRaft writes metadata delta to ZK, but got some exception
 # No retry on step (4), so ZK is having stale metadata.

 

Note:

1. Besides, when doing KRaft controller clean shutdown, we should keep retrying 
the failing ZK writing until force shutdown, to make sure the metadata is 
consistent.

2.  When doing shutdown, [the order of 
shutdown|https://github.com/apache/kafka/blob/1ec1043d5197c4f807fa5cbc41d875b289443096/core/src/main/scala/kafka/server/ControllerServer.scala#L69-L76]
 is to close ZK -> close RPC Client -> close migration driver. That causes 
another issue that even if we retry the ZK write, it will never succeed when 
shutdown is ongoing because ZK connection is closed first.

 

The impact is when rolling back to ZK mode during migration, the metadata in ZK 
is out of date

  was:
When running ZK migrating to KRaft, there will be a dual-write mode. In that 
mode, metadata will write to KRaft, then write to ZK asynchronously. When 
there's some exception, KRaft MigrationEvent won't retry when failing to write 
data to ZK. That causes metadata inconsistency between KRaft and ZK.

 

Note:

1. Besides, when doing KRaft controller clean shutdown, we should keep retrying 
the failing ZK writing until force shutdown, to make sure the metadata is 
consistent.

2.  When doing shutdown, [the order of 
shutdown|https://github.com/apache/kafka/blob/1ec1043d5197c4f807fa5cbc41d875b289443096/core/src/main/scala/kafka/server/ControllerServer.scala#L69-L76]
 is to close ZK -> close RPC Client -> close migration driver. That causes 
another issue that even if we retry the ZK write, it will never succeed when 
shutdown is ongoing because ZK connection is closed first.

 

The impact is when rolling back to ZK mode during migration, the metadata in ZK 
is out of date


> KRaft MigrationEvent won't retry when failing to write data to ZK 
> ------------------------------------------------------------------
>
>                 Key: KAFKA-18930
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18930
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.9.0
>            Reporter: Luke Chen
>            Priority: Major
>
> When running ZK migrating to KRaft, there will be a dual-write mode. In that 
> mode, metadata will write to KRaft (and then ack success to client), then 
> write to ZK asynchronously. When there's some exception, KRaft MigrationEvent 
> won't retry when failing to write data to ZK. That causes metadata 
> inconsistency between KRaft and ZK.
>  
> Ex: In dual-write mode, a client tries to create a topic:
>  # client sends create topic request to KRaft controller
>  # KRaft controller commits this request and create the topic successfully
>  # KRaft controller responds to the client with success.
>  # KRaft writes metadata delta to ZK, but got some exception
>  # No retry on step (4), so ZK is having stale metadata.
>  
> Note:
> 1. Besides, when doing KRaft controller clean shutdown, we should keep 
> retrying the failing ZK writing until force shutdown, to make sure the 
> metadata is consistent.
> 2.  When doing shutdown, [the order of 
> shutdown|https://github.com/apache/kafka/blob/1ec1043d5197c4f807fa5cbc41d875b289443096/core/src/main/scala/kafka/server/ControllerServer.scala#L69-L76]
>  is to close ZK -> close RPC Client -> close migration driver. That causes 
> another issue that even if we retry the ZK write, it will never succeed when 
> shutdown is ongoing because ZK connection is closed first.
>  
> The impact is when rolling back to ZK mode during migration, the metadata in 
> ZK is out of date



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-18930) KRaft MigrationEvent won't retry when failing to write data to ZK

Reply via email to