[ https://issues.apache.org/jira/browse/KAFKA-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Divij Vaidya resolved KAFKA-15391. ---------------------------------- Reviewer: Divij Vaidya Resolution: Fixed > Delete topic may lead to directory offline > ------------------------------------------ > > Key: KAFKA-15391 > URL: https://issues.apache.org/jira/browse/KAFKA-15391 > Project: Kafka > Issue Type: Bug > Components: core > Reporter: Divij Vaidya > Assignee: Haruki Okada > Priority: Major > Fix For: 3.6.0, 3.5.2 > > > This is an edge case where the entire log directory is marked offline when we > delete a topic. This symptoms of this scenario is characterised by the > following logs: > {noformat} > [2023-08-14 09:22:12,600] ERROR Uncaught exception in scheduled task > 'flush-log' (org.apache.kafka.server.util.KafkaScheduler:152) > org.apache.kafka.common.errors.KafkaStorageException: Error while flushing > log for test-0 in dir /tmp/kafka-15093588566723278510 with offset 221 > (exclusive) and recovery point 221 Caused by: > java.nio.file.NoSuchFileException: > /tmp/kafka-15093588566723278510/test-0{noformat} > The above log is followed by logs such as: > {noformat} > [2023-08-14 09:22:12,601] ERROR Uncaught exception in scheduled task > 'flush-log' > (org.apache.kafka.server.util.KafkaScheduler:152)org.apache.kafka.common.errors.KafkaStorageException: > The log dir /tmp/kafka-15093588566723278510 is already offline due to a > previous IO exception.{noformat} > The below sequence of events demonstrate the scenario where this bug manifests > 1. On the broker, partition lock is acquired and UnifiedLog.roll() is called > which schedules an async call for > flushUptoOffsetExclusive(). The roll may be called due to segment rotation > time or size. > 2. Admin client calls deleteTopic > 3. On the broker, LogManager.asyncDelete() is called which will call > UnifiedLog.renameDir() > 4. The directory for the partition is successfully renamed with a "delete" > suffix. > 5. The async task scheduled in step 1 (flushUptoOffsetExclusive) starts > executing. It tries to call localLog.flush() without acquiring a partition > lock. > 6. LocalLog calls Utils.flushDir() which fails with an IOException. > 7. On IOException, log directory is added to logDirFailureChannel > 8. Any new interaction with this logDir fails and a log line is printed such > as > "The log dir $logDir is already offline due to a previous IO exception" > > This is the reason DeleteTopicTest is flaky as well - > https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.tags=trunk&search.timeZoneId=Europe/Berlin&tests.container=kafka.admin.DeleteTopicTest&tests.test=testDeleteTopicWithCleaner() -- This message was sent by Atlassian Jira (v8.20.10#820010)