[ 
https://issues.apache.org/jira/browse/KAFKA-16814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849919#comment-17849919
 ] 

Luke Chen commented on KAFKA-16814:
-----------------------------------

[~brandboat] , thanks for the help! I'd like to make this fix into v3.7.1 and 
v3.8.0 because the impact of this issue is the broker cannot startup at all. 
Let me know if you have any problem.

> KRaft broker cannot startup when `partition.metadata` is missing
> ----------------------------------------------------------------
>
>                 Key: KAFKA-16814
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16814
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.7.0
>            Reporter: Luke Chen
>            Assignee: Kuan Po Tseng
>            Priority: Major
>             Fix For: 3.8.0, 3.7.1
>
>
> When starting up kafka logManager, we'll check stray replicas to avoid some 
> corner cases. But this check might cause broker unable to startup if 
> `partition.metadata` is missing because when startup kafka, we load log from 
> file, and the topicId of the log is coming from `partition.metadata` file. 
> So, if `partition.metadata` is missing, the topicId will be None, and the 
> `LogManager#isStrayKraftReplica` will fail with no topicID error.
> The `partition.metadata` missing could be some storage failure, or another 
> possible path is unclean shutdown after topic is created in the replica, but 
> before data is flushed into `partition.metadata` file. This is possible 
> because we do the flush in async way 
> [here|https://github.com/apache/kafka/blob/5552f5c26df4eb07b2d6ee218e4a29e4ca790d5c/core/src/main/scala/kafka/log/UnifiedLog.scala#L229].
>  
>  
> {code:java}
> ERROR Encountered fatal fault: Error starting LogManager 
> (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
> java.lang.RuntimeException: The log dir 
> Log(dir=/tmp/kraft-broker-logs/quickstart-events-0, topic=quickstart-events, 
> partition=0, highWatermark=0, lastStableOffset=0, logStartOffset=0, 
> logEndOffset=0) does not have a topic ID, which is not allowed when running 
> in KRaft mode.
>     at 
> kafka.log.LogManager$.$anonfun$isStrayKraftReplica$1(LogManager.scala:1609)
>     at scala.Option.getOrElse(Option.scala:201)
>     at kafka.log.LogManager$.isStrayKraftReplica(LogManager.scala:1608)
>     at 
> kafka.server.metadata.BrokerMetadataPublisher.$anonfun$initializeManagers$1(BrokerMetadataPublisher.scala:294)
>     at 
> kafka.server.metadata.BrokerMetadataPublisher.$anonfun$initializeManagers$1$adapted(BrokerMetadataPublisher.scala:294)
>     at kafka.log.LogManager.loadLog(LogManager.scala:359)
>     at kafka.log.LogManager.$anonfun$loadLogs$15(LogManager.scala:493)
>     at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>     at java.base/java.lang.Thread.run(Thread.java:1623) {code}
>  
> Because if we don't do the isStrayKraftReplica check, the topicID and the 
> `partition.metadata` will get recovered after getting topic partition update 
> and becoming leader or follower later. I'm proposing we skip the 
> `isStrayKraftReplica` check if topicID is None, instead of throwing exception 
> to terminate the kafka. `isStrayKraftReplica` check is just for a corner case 
> only, it should be fine IMO.
>  
>  
> === update ===
> Checked KAFKA-14616 and KAFKA-15605, our purpose of finding strayReplicas and 
> delete them is because the replica should be deleted, but left in the log 
> dir. So, if we have a replica that doesn't have topicID (due to 
> `partition.metadata` is missing), then we cannot identify if this is a stray 
> replica or not. In this case, we can do:
>  # Delete it
>  # Ignore it
> For (1), the impact is, if this is not a stray replica, and the 
> replication-factor only has 1, then the data might be moved to another 
> "xxx-stray" dir, and the partition becomes empty.
> For (2), the impact is, if this is a stray replica and we didn't delete it, 
> it might cause partition dir is not created as in KAFKA-15605 or KAFKA-14616.
> As the investigation above, this `partition.metadata` missing issue is mostly 
> because the async `partition.metadata` when creating a topic. Later, before 
> any data append into log, we must make sure partition metadata file is 
> written to the log dir 
> [here|https://github.com/apache/kafka/blob/5552f5c26df4eb07b2d6ee218e4a29e4ca790d5c/core/src/main/scala/kafka/log/UnifiedLog.scala#L772-L774].
>  So, it should be fine if we delete it since the topic should be empty.
> In short, when finding a log without topicID, we should treat it as a stray 
> log and then delete it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to