[ https://issues.apache.org/jira/browse/KAFKA-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810583#comment-17810583 ]
Gaurav Narula commented on KAFKA-16157: --------------------------------------- Here are some notes from my troubleshooting: When {{d1}} fails in {{broker-1}}, the controller updates the {{BrokerRegistration}} such that it only contains one log directory {{d2}}. On handling topic recreation, the controller makes use of an optimisation to assign the replica to the only existing log directory {{d2}} [[0]|https://github.com/apache/kafka/blob/f1924353126fdf6aad2ba1f8d0c22dade59360b1/metadata/src/main/java/org/apache/kafka/controller/ClusterControlManager.java#L721] In {{broker-1}}, {{ReplicaManager::getOrCreatePartition}} is invoked when the topic is recreated. We branch into the case handling {{HostedPartition.Offline}} [[1]|https://github.com/apache/kafka/blob/f1924353126fdf6aad2ba1f8d0c22dade59360b1/core/src/main/scala/kafka/server/ReplicaManager.scala#L2774]. Since {{ReplicaManager::allPartitions}} is keyed by {{TopicPartition}}, it doesn't track the {{TopicId}} which is in the offline log directory. We therefore return {{None}} and not handle the topic delta correctly. First part of the fix would therefore be to modify {{HostedPartition.Offline}} and track the topic id with it. Next part is to handle the log creation correctly. The broker eventually invokes {{Partition::createLogIfNotExists}} [[2]|https://github.com/apache/kafka/blob/f1924353126fdf6aad2ba1f8d0c22dade59360b1/core/src/main/scala/kafka/cluster/Partition.scala#L877], with {{isNew = directoryId == DirectoryId.UNASSIGNED}}. Recall from the optimisation in the controller, the directory id will *not* be {{UNASSIGNED}}, but point to the UUID for {{d2}} and therefore, {{isNew = false}}. Eventually, {{LogManager::getOrCreateLog}} fails when {{isNew = false && offlineLogDirs.nonEmpty}} [[3]|https://github.com/apache/kafka/blob/f1924353126fdf6aad2ba1f8d0c22dade59360b1/core/src/main/scala/kafka/log/LogManager.scala#L1009]. Second part of the the fix is therefore to update {{Partition::createLogInAssignedDirectoryId}} to invoke {{Partition::createLogIfNotExists}} correctly. CC: [~omnia_h_ibrahim] [~soarez] [~cmccabe] [~pprovenzano] > Topic recreation with offline disk doesn't update leadership/shrink ISR > correctly > --------------------------------------------------------------------------------- > > Key: KAFKA-16157 > URL: https://issues.apache.org/jira/browse/KAFKA-16157 > Project: Kafka > Issue Type: Bug > Components: jbod, kraft > Affects Versions: 3.7.0 > Reporter: Gaurav Narula > Priority: Blocker > Fix For: 3.7.0 > > Attachments: broker.log, broker.log.1, broker.log.10, broker.log.2, > broker.log.3, broker.log.4, broker.log.5, broker.log.6, broker.log.7, > broker.log.8, broker.log.9 > > > In a cluster with 4 brokers, `broker-1..broker-4` with 2 disks `d1` and `d2` > in each broker, we perform the following operations: > > # Create a topic `foo.test` with 10 partitions and RF 4. Let's assume the > topic was created with id `rAujIqcjRbu_-E4UxgQT8Q`. > # Start a producer in the background to produce to `foo.test`. > # Break disk `d1` in `broker-1`. We simulate this by marking the log dir > read-only. > # Delete topic `foo.test` > # Recreate topic `foo.test`. Let's assume the topic was created with id > `bgdrsv-1QjCLFEqLOzVCHg`. > # Wait for 5 minutes > # Describe the recreated topic `foo.test`. > > We observe that `broker-1` is the leader and in-sync for few partitions > > > {code:java} > > Topic: foo.test TopicId: bgdrsv-1QjCLFEqLOzVCHg PartitionCount: 10 > ReplicationFactor: 4 Configs: > min.insync.replicas=1,unclean.leader.election.enable=false > Topic: foo.test Partition: 0 Leader: 101 Replicas: > 101,102,103,104 Isr: 101,102,103,104 > Topic: foo.test Partition: 1 Leader: 102 Replicas: > 102,103,104,101 Isr: 102,103,104 > Topic: foo.test Partition: 2 Leader: 103 Replicas: > 103,104,101,102 Isr: 103,104,102 > Topic: foo.test Partition: 3 Leader: 104 Replicas: > 104,101,102,103 Isr: 104,102,103 > Topic: foo.test Partition: 4 Leader: 104 Replicas: > 104,102,101,103 Isr: 104,102,103 > Topic: foo.test Partition: 5 Leader: 102 Replicas: > 102,101,103,104 Isr: 102,103,104 > Topic: foo.test Partition: 6 Leader: 101 Replicas: > 101,103,104,102 Isr: 101,103,104,102 > Topic: foo.test Partition: 7 Leader: 103 Replicas: > 103,104,102,101 Isr: 103,104,102 > Topic: foo.test Partition: 8 Leader: 101 Replicas: > 101,102,104,103 Isr: 101,102,104,103 > Topic: foo.test Partition: 9 Leader: 102 Replicas: > 102,104,103,101 Isr: 102,104,103 > {code} > > > In this example, it is the leader of partitions `0, 6 and 8`. > > Consider `foo.test-8`. It is present in the following brokers/disks: > > > {code:java} > $ fd foo.test-8 > broker-1/d1/foo.test-8/ > broker-2/d2/foo.test-8/ > broker-3/d2/foo.test-8/ > broker-4/d1/foo.test-8/{code} > > > `broker-1/d1` still refers to the topic id which is pending deletion because > the log dir is marked offline. > > > {code:java} > $ cat broker-1/d1/foo.test-8/partition.metadata > version: 0 > topic_id: rAujIqcjRbu_-E4UxgQT8Q{code} > > > However, other brokers have the correct topic-id > > > {code:java} > $ cat broker-2/d2/foo.test-8/partition.metadata > version: 0 > topic_id: bgdrsv-1QjCLFEqLOzVCHg%{code} > > > Now, let's consider `foo.test-0`. We observe that the replica isn't present > in `broker-1`: > {code:java} > $ fd foo.test-0 > broker-2/d1/foo.test-0/ > broker-3/d1/foo.test-0/ > broker-4/d2/foo.test-0/{code} > In both cases, `broker-1` shouldn't be the leader or in-sync replica for the > partitions. > -- This message was sent by Atlassian Jira (v8.20.10#820010)