Jerome Morel created KAFKA-16997:
------------------------------------

             Summary: do not stop kafka when issue to delete a partition folder
                 Key: KAFKA-16997
                 URL: https://issues.apache.org/jira/browse/KAFKA-16997
             Project: Kafka
          Issue Type: Improvement
          Components: core
    Affects Versions: 3.6.2
            Reporter: Jerome Morel


Context: In our project we create different partitions and even if we delete 
the segments those remains and it came out we have so many partitions that 
kafka crashes due to amount of open files. Therefore we want to delete 
regularly those partitions but we get during that kafka stopping.

 

The issue: after some investigations we found out that the deletion process 
gives sometimes warnings if it cannot delete some log files:
{code:java}
[2024-06-17 15:52:39,590] WARN Failed atomic move of 
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex
 to 
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex.deleted
 retrying with a non-atomic move (org.apache.kafka.common.utils.Utils)
java.nio.file.NoSuchFileException: 
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex
 -> 
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex.deleted
        at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:416)
        at 
java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:266)
        at java.base/java.nio.file.Files.move(Files.java:1432)
        at 
org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:980)
        at 
org.apache.kafka.storage.internals.log.LazyIndex$IndexFile.renameTo(LazyIndex.java:80)
        at 
org.apache.kafka.storage.internals.log.LazyIndex.renameTo(LazyIndex.java:202)
        at 
org.apache.kafka.storage.internals.log.LogSegment.changeFileSuffixes(LogSegment.java:666)
        at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$1(LocalLog.scala:912)
        at 
kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$1$adapted(LocalLog.scala:910)
        at scala.collection.immutable.List.foreach(List.scala:431)
        at kafka.log.LocalLog$.deleteSegmentFiles(LocalLog.scala:910)
        at kafka.log.LocalLog.removeAndDeleteSegments(LocalLog.scala:289) {code}
And just continue but when it is to delete a folder then it mark the replica as 
not ok and then stop kafka if only replica available (which is our case):
{code:java}
[2024-06-17 15:52:39,637] ERROR Error while deleting dir for 
69747657-f49d-453f-9fa2-4d4369199699-0 in dir 
/tmp/kafka-logs-mnt/kafka-no-docker 
(org.apache.kafka.storage.internals.log.LogDirFailureChannel)
java.nio.file.DirectoryNotEmptyException: 
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete
        at 
java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:246)
        at 
java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
        at java.base/java.nio.file.Files.delete(Files.java:1152)
        at 
org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:923)
        at 
org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:901)
        at java.base/java.nio.file.Files.walkFileTree(Files.java:2828)
        at java.base/java.nio.file.Files.walkFileTree(Files.java:2882)
        at org.apache.kafka.common.utils.Utils.delete(Utils.java:901)
        at kafka.log.LocalLog.$anonfun$deleteEmptyDir$2(LocalLog.scala:243)
        at kafka.log.LocalLog.deleteEmptyDir(LocalLog.scala:709)
        at kafka.log.UnifiedLog.$anonfun$delete$2(UnifiedLog.scala:1734)
        at kafka.log.UnifiedLog.delete(UnifiedLog.scala:1911)
        at kafka.log.LogManager.deleteLogs(LogManager.scala:1152)
        at kafka.log.LogManager.$anonfun$deleteLogs$6(LogManager.scala:1166)
        at 
org.apache.kafka.server.util.KafkaScheduler.lambda$schedule$1(KafkaScheduler.java:150)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
[2024-06-17 15:52:39,640] WARN [ReplicaManager broker=0] Stopping serving 
replicas in dir /tmp/kafka-logs-mnt/kafka-no-docker 
(kafka.server.ReplicaManager)
[2024-06-17 15:52:39,640] INFO [LocalLog 
partition=a11f3352-56fc-4d00-bdf8-f5fee33391f6-0, 
dir=/tmp/kafka-logs-mnt/kafka-no-docker] Deleting segment files 
LogSegment(baseOffset=0, size=861, lastModifiedTime=0, 
largestRecordTimestamp=1718632120826) (kafka.log.LocalLog$)
[2024-06-17 15:52:39,641] ERROR Uncaught exception in scheduled task 
'delete-file' (org.apache.kafka.server.util.KafkaScheduler)
org.apache.kafka.common.errors.KafkaStorageException: The log dir 
/tmp/kafka-logs-mnt/kafka-no-docker is already offline due to a previous IO 
exception.
[2024-06-17 15:52:39,641] ERROR Exception while deleting 
Log(dir=/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete,
 topicId=wohaEWpfTR6HuqDFlcIJYw, topic=69747657-f49d-453f-9fa2-4d4369199699, 
partition=0, highWatermark=10, lastStableOffset=10, logStartOffset=10, 
logEndOffset=10) in dir /tmp/kafka-logs-mnt/kafka-no-docker. 
(kafka.log.LogManager)
org.apache.kafka.common.errors.KafkaStorageException: Error while deleting dir 
for 69747657-f49d-453f-9fa2-4d4369199699-0 in dir 
/tmp/kafka-logs-mnt/kafka-no-docker
Caused by: java.nio.file.DirectoryNotEmptyException: 
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete
        at 
java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:246)
        at 
java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
        at java.base/java.nio.file.Files.delete(Files.java:1152)
        at 
org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:923)
        at 
org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:901)
        at java.base/java.nio.file.Files.walkFileTree(Files.java:2828)
        at java.base/java.nio.file.Files.walkFileTree(Files.java:2882)
        at org.apache.kafka.common.utils.Utils.delete(Utils.java:901)
        at kafka.log.LocalLog.$anonfun$deleteEmptyDir$2(LocalLog.scala:243)
        at kafka.log.LocalLog.deleteEmptyDir(LocalLog.scala:709)
        at kafka.log.UnifiedLog.$anonfun$delete$2(UnifiedLog.scala:1734)
        at kafka.log.UnifiedLog.delete(UnifiedLog.scala:1911)
        at kafka.log.LogManager.deleteLogs(LogManager.scala:1152)
        at kafka.log.LogManager.$anonfun$deleteLogs$6(LogManager.scala:1166)
        at 
org.apache.kafka.server.util.KafkaScheduler.lambda$schedule$1(KafkaScheduler.java:150)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)
[2024-06-17 15:52:39,642] INFO [ReplicaFetcherManager on broker 0] Removed 
fetcher for partitions Set {code}
we tried with different version of kafka (2.8 and 3.7) and it is the same.

Is there a reason to just put a warning when a file in the partition cannot be 
deleted but blew up when it is the directory itself that cannot be deleted? Is 
it possible to also gives a warning when the directory cannot be deleted and 
just process.

In our case after restart of kafka all gets deleted as expected (disc glitch 
issue).

Remark: our server does not have local storage so we use a network disc and 
such glitch may happen often.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to