Hello, We have installed a version of Confluent Kafka 5.2.1 for our 3 node Kafka cluster and 3 node Zookeeper cluster.
We have a number of topics in the Kafka cluster and processes which continuously write data into some of these topics. Along with these topics we have some test topic which are rarely used. But for the past couple of months we are seeing some weird behavior in these test topics which leads to crashing of Kafka node. It seems that the Kafka Scheduler throws Exception and crashes the service when it could not find a log to be deleted. I am not able to understand that even when there is no data, why does Kafka Scheduler looks for deleting a particular log segment from a topic's log directory. Attaching logged error details where the original topic names have been replaced with different names. Also I have copied and pasted the error below. [2019-12-02 21:25:59,269] ERROR Error while deleting segments for test-debug-0 in dir /tmp/kafka-logs (kafka.server.LogDirFailureChannel:76) java.nio.file.NoSuchFileException: /tmp/kafka-logs/test-debug-0/00000000000000000000.log at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409) at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) at java.nio.file.Files.move(Files.java:1395) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:805) at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:224) at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:488) at kafka.log.Log.asyncDeleteSegment(Log.scala:1924) at kafka.log.Log.deleteSegment(Log.scala:1909) at kafka.log.Log.$anonfun$deleteSegments$3(Log.scala:1455) at kafka.log.Log.$anonfun$deleteSegments$3$adapted(Log.scala:1455) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at kafka.log.Log.$anonfun$deleteSegments$2(Log.scala:1455) at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) at kafka.log.Log.maybeHandleIOException(Log.scala:2013) at kafka.log.Log.deleteSegments(Log.scala:1446) at kafka.log.Log.deleteOldSegments(Log.scala:1441) at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1519) at kafka.log.Log.deleteOldSegments(Log.scala:1509) at kafka.log.LogManager.$anonfun$cleanupLogs$3(LogManager.scala:913) at kafka.log.LogManager.$anonfun$cleanupLogs$3$adapted(LogManager.scala:910) at scala.collection.immutable.List.foreach(List.scala:392) at kafka.log.LogManager.cleanupLogs(LogManager.scala:910) at kafka.log.LogManager.$anonfun$startup$2(LogManager.scala:395) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Suppressed: java.nio.file.NoSuchFileException: /tmp/kafka-logs/test-debug-0/00000000000000000000.log -> /tmp/kafka-logs/test-debug-0/00000000000000000000.log.deleted at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396) at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) at java.nio.file.Files.move(Files.java:1395) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:802) ... 30 more [2019-12-02 21:25:59,271] ERROR Uncaught exception in scheduled task 'kafka-log-retention' (kafka.utils.KafkaScheduler:76) org.apache.kafka.common.errors.KafkaStorageException: Error while deleting segments for test-debug-0 in dir /tmp/kafka-logs Caused by: java.nio.file.NoSuchFileException: /tmp/kafka-logs/test-debug-0/00000000000000000000.log at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:409) at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) at java.nio.file.Files.move(Files.java:1395) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:805) at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:224) at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:488) at kafka.log.Log.asyncDeleteSegment(Log.scala:1924) at kafka.log.Log.deleteSegment(Log.scala:1909) at kafka.log.Log.$anonfun$deleteSegments$3(Log.scala:1455) at kafka.log.Log.$anonfun$deleteSegments$3$adapted(Log.scala:1455) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at kafka.log.Log.$anonfun$deleteSegments$2(Log.scala:1455) at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23) at kafka.log.Log.maybeHandleIOException(Log.scala:2013) at kafka.log.Log.deleteSegments(Log.scala:1446) at kafka.log.Log.deleteOldSegments(Log.scala:1441) at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1519) at kafka.log.Log.deleteOldSegments(Log.scala:1509) at kafka.log.LogManager.$anonfun$cleanupLogs$3(LogManager.scala:913) at kafka.log.LogManager.$anonfun$cleanupLogs$3$adapted(LogManager.scala:910) at scala.collection.immutable.List.foreach(List.scala:392) at kafka.log.LogManager.cleanupLogs(LogManager.scala:910) at kafka.log.LogManager.$anonfun$startup$2(LogManager.scala:395) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Suppressed: java.nio.file.NoSuchFileException: /tmp/kafka-logs/test-debug-0/00000000000000000000.log -> /tmp/kafka-logs/test-debug-0/00000000000000000000.log.deleted at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:396) at sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:262) at java.nio.file.Files.move(Files.java:1395) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:802) ... 30 more [2019-12-02 21:25:59,272] INFO [ReplicaManager broker=1] Stopping serving replicas in dir /tmp/kafka-logs (kafka.server.ReplicaManager:66) [2019-12-02 21:25:59,287] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(test-debug-0, <other topics.....>) (kafka.server.ReplicaAlterLogDirsManager:66) [2019-12-02 21:25:59,326] INFO [KafkaApi-1] Closing connection due to error during produce request with correlation id 81507 from client id with ack=0 Topic and partition to exceptions: topic_a-0 -> org.apache.kafka.common.errors.KafkaStorageException (kafka.server.KafkaApis:66) [2019-12-02 21:25:59,409] INFO [ReplicaManager broker=1] Broker 1 stopped fetcher for partitions test-debug-0, , <other topics.....> and stopped moving logs for partitions because they are in the failed log directory /tmp/kafka-logs. (kafka.server.ReplicaManager:66) [2019-12-02 21:25:59,409] INFO Stopping serving logs in dir /tmp/kafka-logs (kafka.log.LogManager:66) [2019-12-02 21:25:59,411] INFO [KafkaApi-1] Closing connection due to error during produce request with correlation id 57755 from client id with ack=0 Topic and partition to exceptions: topic_-2 -> org.apache.kafka.common.errors.KafkaStorageException (kafka.server.KafkaApis:66) [2019-12-02 21:25:59,428] ERROR Shutdown broker because all log dirs in /tmp/kafka-logs have failed (kafka.log.LogManager:143) Kindly guide me through how to overcome this error. Any help would be much appreciated. Best, Akhauri Prateek Shekhar