[ 
https://issues.apache.org/jira/browse/KAFKA-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757413#comment-17757413
 ] 

Divij Vaidya commented on KAFKA-15391:
--------------------------------------

[~ocadaruma] It is a bit different from the perspective that the race condition 
is not with retention threads but instead between two async segment roll and 
request handler thread, but the fix is same. We shouldn't be erroring out if 
the file that we want to delete doesn't exist.

> Delete topic may lead to directory offline
> ------------------------------------------
>
>                 Key: KAFKA-15391
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15391
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Divij Vaidya
>            Priority: Major
>             Fix For: 3.6.0
>
>
> This is an edge case where the entire log directory is marked offline when we 
> delete a topic. This symptoms of this scenario is characterised by the 
> following logs:
> {noformat}
> [2023-08-14 09:22:12,600] ERROR Uncaught exception in scheduled task 
> 'flush-log' (org.apache.kafka.server.util.KafkaScheduler:152)  
> org.apache.kafka.common.errors.KafkaStorageException: Error while flushing 
> log for test-0 in dir /tmp/kafka-15093588566723278510 with offset 221 
> (exclusive) and recovery point 221 Caused by: 
> java.nio.file.NoSuchFileException: 
> /tmp/kafka-15093588566723278510/test-0{noformat}
> The above log is followed by logs such as:
> {noformat}
> [2023-08-14 09:22:12,601] ERROR Uncaught exception in scheduled task 
> 'flush-log' 
> (org.apache.kafka.server.util.KafkaScheduler:152)org.apache.kafka.common.errors.KafkaStorageException:
>  The log dir /tmp/kafka-15093588566723278510 is already offline due to a 
> previous IO exception.{noformat}
> The below sequence of events demonstrate the scenario where this bug manifests
> 1.  On the broker, partition lock is acquired and UnifiedLog.roll() is called 
> which schedules an async call for 
> flushUptoOffsetExclusive(). The roll may be called due to segment rotation 
> time or size.
> 2. Admin client calls deleteTopic
> 3. On the broker, LogManager.asyncDelete() is called which will call 
> UnifiedLog.renameDir()
> 4. The directory for the partition is successfully renamed with a "delete" 
> suffix.
> 5. The async task scheduled in step 1 (flushUptoOffsetExclusive) starts 
> executing. It tries to call localLog.flush() without acquiring a partition 
> lock. 
> 6. LocalLog calls Utils.flushDir() which fails with an IOException.
> 7. On IOException, log directory is added to logDirFailureChannel
> 8. Any new interaction with this logDir fails and a log line is printed such 
> as 
> "The log dir $logDir is already offline due to a previous IO exception"
>  
> This is the reason DeleteTopicTest is flaky as well - 
> https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=Europe/Berlin&tests.container=kafka.admin.DeleteTopicTest&tests.test=testDeleteTopicWithCleaner()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to