Divij Vaidya created KAFKA-15391:
------------------------------------

             Summary: Delete topic may lead to directory offline
                 Key: KAFKA-15391
                 URL: https://issues.apache.org/jira/browse/KAFKA-15391
             Project: Kafka
          Issue Type: Bug
          Components: core
            Reporter: Divij Vaidya
             Fix For: 3.6.0


This is an edge case where the entire log directory is marked offline when we 
delete a topic. This symptoms of this scenario is characterised by the 
following logs:


{noformat}
[2023-08-14 09:22:12,600] ERROR Uncaught exception in scheduled task 
'flush-log' (org.apache.kafka.server.util.KafkaScheduler:152)  
org.apache.kafka.common.errors.KafkaStorageException: Error while flushing log 
for test-0 in dir /tmp/kafka-15093588566723278510 with offset 221 (exclusive) 
and recovery point 221 Caused by: java.nio.file.NoSuchFileException: 
/tmp/kafka-15093588566723278510/test-0{noformat}
The above log is followed by logs such as:
{noformat}
[2023-08-14 09:22:12,601] ERROR Uncaught exception in scheduled task 
'flush-log' 
(org.apache.kafka.server.util.KafkaScheduler:152)org.apache.kafka.common.errors.KafkaStorageException:
 The log dir /tmp/kafka-15093588566723278510 is already offline due to a 
previous IO exception.{noformat}

The below sequence of events demonstrate the scenario where this bug manifests
1.  On the broker, partition lock is acquired and UnifiedLog.roll() is called 
which schedules an async call for 
flushUptoOffsetExclusive(). The roll may be called due to segment rotation time 
or size.
2. Admin client calls deleteTopic
3. On the broker, LogManager.asyncDelete() is called which will call 
UnifiedLog.renameDir()
4. The directory for the partition is successfully renamed with a "delete" 
suffix.
5. The async task scheduled in step 1 (flushUptoOffsetExclusive) starts 
executing. It tries to call localLog.flush() without acquiring a partition 
lock. 
6. LocalLog calls Utils.flushDir() which fails with an IOException.
7. On IOException, log directory is added to logDirFailureChannel
8. Any new interaction with this logDir fails and a log line is printed such as 
"The log dir $logDir is already offline due to a previous IO exception"
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to