[jira] [Commented] (KAFKA-16157) Topic recreation with offline disk doesn't update leadership/shrink ISR correctly

Gaurav Narula (Jira) Wed, 24 Jan 2024 14:36:04 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810583#comment-17810583
 ]


Gaurav Narula commented on KAFKA-16157:
---------------------------------------

Here are some notes from my troubleshooting:

When {{d1}} fails in {{broker-1}}, the controller updates the 
{{BrokerRegistration}} such that it only contains one log directory {{d2}}. On 
handling topic recreation, the controller makes use of an optimisation to 
assign the replica to the only existing log directory {{d2}} 
[[0]|https://github.com/apache/kafka/blob/f1924353126fdf6aad2ba1f8d0c22dade59360b1/metadata/src/main/java/org/apache/kafka/controller/ClusterControlManager.java#L721]

In {{broker-1}}, {{ReplicaManager::getOrCreatePartition}} is invoked when the 
topic is recreated. We branch into the case handling 
{{HostedPartition.Offline}} 
[[1]|https://github.com/apache/kafka/blob/f1924353126fdf6aad2ba1f8d0c22dade59360b1/core/src/main/scala/kafka/server/ReplicaManager.scala#L2774].
 Since {{ReplicaManager::allPartitions}} is keyed by {{TopicPartition}}, it 
doesn't track the {{TopicId}} which is in the offline log directory. We 
therefore return {{None}} and not handle the topic delta correctly. First part 
of the fix would therefore be to modify {{HostedPartition.Offline}} and track 
the topic id with it.

Next part is to handle the log creation correctly. The broker eventually 
invokes {{Partition::createLogIfNotExists}} 
[[2]|https://github.com/apache/kafka/blob/f1924353126fdf6aad2ba1f8d0c22dade59360b1/core/src/main/scala/kafka/cluster/Partition.scala#L877],
 with {{isNew = directoryId == DirectoryId.UNASSIGNED}}. Recall from the 
optimisation in the controller, the directory id will *not* be {{UNASSIGNED}}, 
but point to the UUID for {{d2}} and therefore, {{isNew = false}}. Eventually, 
{{LogManager::getOrCreateLog}} fails when {{isNew = false && 
offlineLogDirs.nonEmpty}} 
[[3]|https://github.com/apache/kafka/blob/f1924353126fdf6aad2ba1f8d0c22dade59360b1/core/src/main/scala/kafka/log/LogManager.scala#L1009].
 Second part of the the fix is therefore to update 
{{Partition::createLogInAssignedDirectoryId}} to invoke 
{{Partition::createLogIfNotExists}} correctly.

CC: [~omnia_h_ibrahim] [~soarez] [~cmccabe] [~pprovenzano]

> Topic recreation with offline disk doesn't update leadership/shrink ISR 
> correctly
> ---------------------------------------------------------------------------------
>
>                 Key: KAFKA-16157
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16157
>             Project: Kafka
>          Issue Type: Bug
>          Components: jbod, kraft
>    Affects Versions: 3.7.0
>            Reporter: Gaurav Narula
>            Priority: Blocker
>             Fix For: 3.7.0
>
>         Attachments: broker.log, broker.log.1, broker.log.10, broker.log.2, 
> broker.log.3, broker.log.4, broker.log.5, broker.log.6, broker.log.7, 
> broker.log.8, broker.log.9
>
>
> In a cluster with 4 brokers, `broker-1..broker-4` with 2 disks `d1` and `d2` 
> in each broker, we perform the following operations:
>  
>  # Create a topic `foo.test` with 10 partitions and RF 4. Let's assume the 
> topic was created with id `rAujIqcjRbu_-E4UxgQT8Q`.
>  # Start a producer in the background to produce to `foo.test`.
>  # Break disk `d1` in `broker-1`. We simulate this by marking the log dir 
> read-only.
>  # Delete topic `foo.test`
>  # Recreate topic `foo.test`. Let's assume the topic was created with id 
> `bgdrsv-1QjCLFEqLOzVCHg`.
>  # Wait for 5 minutes
>  # Describe the recreated topic `foo.test`.
>  
> We observe that `broker-1` is the leader and in-sync for few partitions
>  
>  
> {code:java}
>  
> Topic: foo.test TopicId: bgdrsv-1QjCLFEqLOzVCHg PartitionCount: 10      
> ReplicationFactor: 4    Configs: 
> min.insync.replicas=1,unclean.leader.election.enable=false
>         Topic: foo.test Partition: 0    Leader: 101     Replicas: 
> 101,102,103,104       Isr: 101,102,103,104
>         Topic: foo.test Partition: 1    Leader: 102     Replicas: 
> 102,103,104,101       Isr: 102,103,104
>         Topic: foo.test Partition: 2    Leader: 103     Replicas: 
> 103,104,101,102       Isr: 103,104,102
>         Topic: foo.test Partition: 3    Leader: 104     Replicas: 
> 104,101,102,103       Isr: 104,102,103
>         Topic: foo.test Partition: 4    Leader: 104     Replicas: 
> 104,102,101,103       Isr: 104,102,103
>         Topic: foo.test Partition: 5    Leader: 102     Replicas: 
> 102,101,103,104       Isr: 102,103,104
>         Topic: foo.test Partition: 6    Leader: 101     Replicas: 
> 101,103,104,102       Isr: 101,103,104,102
>         Topic: foo.test Partition: 7    Leader: 103     Replicas: 
> 103,104,102,101       Isr: 103,104,102
>         Topic: foo.test Partition: 8    Leader: 101     Replicas: 
> 101,102,104,103       Isr: 101,102,104,103
>         Topic: foo.test Partition: 9    Leader: 102     Replicas: 
> 102,104,103,101       Isr: 102,104,103
> {code}
>  
>  
> In this example, it is the leader of partitions `0, 6 and 8`.
>  
> Consider `foo.test-8`. It is present in the following brokers/disks:
>  
>  
> {code:java}
> $ fd foo.test-8
> broker-1/d1/foo.test-8/
> broker-2/d2/foo.test-8/
> broker-3/d2/foo.test-8/
> broker-4/d1/foo.test-8/{code}
>  
>  
> `broker-1/d1` still refers to the topic id which is pending deletion because 
> the log dir is marked offline.
>  
>  
> {code:java}
> $ cat broker-1/d1/foo.test-8/partition.metadata
> version: 0
> topic_id: rAujIqcjRbu_-E4UxgQT8Q{code}
>  
>  
> However, other brokers have the correct topic-id
>  
>  
> {code:java}
> $ cat broker-2/d2/foo.test-8/partition.metadata
> version: 0
> topic_id: bgdrsv-1QjCLFEqLOzVCHg%{code}
>  
>  
> Now, let's consider `foo.test-0`. We observe that the replica isn't present 
> in `broker-1`:
> {code:java}
> $ fd foo.test-0
> broker-2/d1/foo.test-0/
> broker-3/d1/foo.test-0/
> broker-4/d2/foo.test-0/{code}
> In both cases, `broker-1` shouldn't be the leader or in-sync replica for the 
> partitions.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-16157) Topic recreation with offline disk doesn't update leadership/shrink ISR correctly

Reply via email to