[ 
https://issues.apache.org/jira/browse/KAFKA-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Divij Vaidya resolved KAFKA-15391.
----------------------------------
      Reviewer: Divij Vaidya
    Resolution: Fixed

> Delete topic may lead to directory offline
> ------------------------------------------
>
>                 Key: KAFKA-15391
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15391
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Divij Vaidya
>            Assignee: Haruki Okada
>            Priority: Major
>             Fix For: 3.6.0, 3.5.2
>
>
> This is an edge case where the entire log directory is marked offline when we 
> delete a topic. This symptoms of this scenario is characterised by the 
> following logs:
> {noformat}
> [2023-08-14 09:22:12,600] ERROR Uncaught exception in scheduled task 
> 'flush-log' (org.apache.kafka.server.util.KafkaScheduler:152)  
> org.apache.kafka.common.errors.KafkaStorageException: Error while flushing 
> log for test-0 in dir /tmp/kafka-15093588566723278510 with offset 221 
> (exclusive) and recovery point 221 Caused by: 
> java.nio.file.NoSuchFileException: 
> /tmp/kafka-15093588566723278510/test-0{noformat}
> The above log is followed by logs such as:
> {noformat}
> [2023-08-14 09:22:12,601] ERROR Uncaught exception in scheduled task 
> 'flush-log' 
> (org.apache.kafka.server.util.KafkaScheduler:152)org.apache.kafka.common.errors.KafkaStorageException:
>  The log dir /tmp/kafka-15093588566723278510 is already offline due to a 
> previous IO exception.{noformat}
> The below sequence of events demonstrate the scenario where this bug manifests
> 1.  On the broker, partition lock is acquired and UnifiedLog.roll() is called 
> which schedules an async call for 
> flushUptoOffsetExclusive(). The roll may be called due to segment rotation 
> time or size.
> 2. Admin client calls deleteTopic
> 3. On the broker, LogManager.asyncDelete() is called which will call 
> UnifiedLog.renameDir()
> 4. The directory for the partition is successfully renamed with a "delete" 
> suffix.
> 5. The async task scheduled in step 1 (flushUptoOffsetExclusive) starts 
> executing. It tries to call localLog.flush() without acquiring a partition 
> lock. 
> 6. LocalLog calls Utils.flushDir() which fails with an IOException.
> 7. On IOException, log directory is added to logDirFailureChannel
> 8. Any new interaction with this logDir fails and a log line is printed such 
> as 
> "The log dir $logDir is already offline due to a previous IO exception"
>  
> This is the reason DeleteTopicTest is flaky as well - 
> https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=Europe/Berlin&tests.container=kafka.admin.DeleteTopicTest&tests.test=testDeleteTopicWithCleaner()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to