[jira] [Comment Edited] (KAFKA-13403) KafkaServer crashes when deleting topics due to the race in log deletion

2022-04-26 Thread Haruki Okada (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527800#comment-17527800
 ] 

Haruki Okada edited comment on KAFKA-13403 at 4/26/22 7:35 AM:
---

[~showuon] Hi, could you help reviewing the PR 
[https://github.com/apache/kafka/pull/11438] ?

 

-There seems to be another ticket likely due to the same cause: 
https://issues.apache.org/jira/browse/KAFKA-13855-


After took another look at 13855, seems currently there's no clue to conclude 
it is the same cause.


was (Author: ocadaruma):
[~showuon] Hi, could you help reviewing the PR 
[https://github.com/apache/kafka/pull/11438] ?

 

There seems to be another ticket likely due to the same cause: 
https://issues.apache.org/jira/browse/KAFKA-13855

> KafkaServer crashes when deleting topics due to the race in log deletion
> 
>
> Key: KAFKA-13403
> URL: https://issues.apache.org/jira/browse/KAFKA-13403
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.4.1
>Reporter: Haruki Okada
>Assignee: Haruki Okada
>Priority: Major
>
> h2. Environment
>  * OS: CentOS Linux release 7.6
>  * Kafka version: 2.4.1
>  * 
>  ** But as far as I checked the code, I think same phenomenon could happen 
> even on trunk
>  * Kafka log directory: RAID1+0 (i.e. not using JBOD so only single log.dirs 
> is set)
>  * Java version: AdoptOpenJDK 1.8.0_282
> h2. Phenomenon
> When we were in the middle of deleting several topics by `kafka-topics.sh 
> --delete --topic blah-blah`, one broker in our cluster crashed due to 
> following exception:
>  
> {code:java}
> [2021-10-21 18:19:19,122] ERROR Shutdown broker because all log dirs in 
> /data/kafka have failed (kafka.log.LogManager)
> {code}
>  
>  
> We also found NoSuchFileException was thrown right before the crash when 
> LogManager tried to delete logs for some partitions.
>  
> {code:java}
> [2021-10-21 18:19:18,849] ERROR Error while deleting log for foo-bar-topic-5 
> in dir /data/kafka (kafka.server.LogDirFailureChannel)
> java.nio.file.NoSuchFileException: 
> /data/kafka/foo-bar-topic-5.df3626d2d9eb41a2aeb0b8d55d7942bd-delete/03877066.timeindex.deleted
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
> at java.nio.file.Files.readAttributes(Files.java:1737)
> at java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219)
> at java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276)
> at java.nio.file.FileTreeWalker.next(FileTreeWalker.java:372)
> at java.nio.file.Files.walkFileTree(Files.java:2706)
> at java.nio.file.Files.walkFileTree(Files.java:2742)
> at org.apache.kafka.common.utils.Utils.delete(Utils.java:732)
> at kafka.log.Log.$anonfun$delete$2(Log.scala:2036)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> at kafka.log.Log.maybeHandleIOException(Log.scala:2343)
> at kafka.log.Log.delete(Log.scala:2030)
> at kafka.log.LogManager.deleteLogs(LogManager.scala:826)
> at kafka.log.LogManager.$anonfun$deleteLogs$6(LogManager.scala:840)
> at 
> kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> So, the log-dir was marked as offline and ended up with KafkaServer crash 
> because the broker has only single log-dir.
> h2. Cause
> We also found below logs right before the NoSuchFileException.
>  
> {code:java}
> [2021-10-21 18:18:17,829] INFO Log for partition foo-bar-5 is renamed to 
> 

[jira] [Comment Edited] (KAFKA-13403) KafkaServer crashes when deleting topics due to the race in log deletion

2021-10-26 Thread Haruki Okada (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434630#comment-17434630
 ] 

Haruki Okada edited comment on KAFKA-13403 at 10/27/21, 1:49 AM:
-

[~showuon] Already working on it? or may I (or my colleague).


was (Author: ocadaruma):
[~showuon] Already working on it? or may I.

> KafkaServer crashes when deleting topics due to the race in log deletion
> 
>
> Key: KAFKA-13403
> URL: https://issues.apache.org/jira/browse/KAFKA-13403
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.4.1
>Reporter: Haruki Okada
>Assignee: Luke Chen
>Priority: Major
>
> h2. Environment
>  * OS: CentOS Linux release 7.6
>  * Kafka version: 2.4.1
>  * 
>  ** But as far as I checked the code, I think same phenomenon could happen 
> even on trunk
>  * Kafka log directory: RAID1+0 (i.e. not using JBOD so only single log.dirs 
> is set)
>  * Java version: AdoptOpenJDK 1.8.0_282
> h2. Phenomenon
> When we were in the middle of deleting several topics by `kafka-topics.sh 
> --delete --topic blah-blah`, one broker in our cluster crashed due to 
> following exception:
>  
> {code:java}
> [2021-10-21 18:19:19,122] ERROR Shutdown broker because all log dirs in 
> /data/kafka have failed (kafka.log.LogManager)
> {code}
>  
>  
> We also found NoSuchFileException was thrown right before the crash when 
> LogManager tried to delete logs for some partitions.
>  
> {code:java}
> [2021-10-21 18:19:18,849] ERROR Error while deleting log for foo-bar-topic-5 
> in dir /data/kafka (kafka.server.LogDirFailureChannel)
> java.nio.file.NoSuchFileException: 
> /data/kafka/foo-bar-topic-5.df3626d2d9eb41a2aeb0b8d55d7942bd-delete/03877066.timeindex.deleted
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
> at java.nio.file.Files.readAttributes(Files.java:1737)
> at java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219)
> at java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276)
> at java.nio.file.FileTreeWalker.next(FileTreeWalker.java:372)
> at java.nio.file.Files.walkFileTree(Files.java:2706)
> at java.nio.file.Files.walkFileTree(Files.java:2742)
> at org.apache.kafka.common.utils.Utils.delete(Utils.java:732)
> at kafka.log.Log.$anonfun$delete$2(Log.scala:2036)
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> at kafka.log.Log.maybeHandleIOException(Log.scala:2343)
> at kafka.log.Log.delete(Log.scala:2030)
> at kafka.log.LogManager.deleteLogs(LogManager.scala:826)
> at kafka.log.LogManager.$anonfun$deleteLogs$6(LogManager.scala:840)
> at 
> kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:116)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:65)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> So, the log-dir was marked as offline and ended up with KafkaServer crash 
> because the broker has only single log-dir.
> h2. Cause
> We also found below logs right before the NoSuchFileException.
>  
> {code:java}
> [2021-10-21 18:18:17,829] INFO Log for partition foo-bar-5 is renamed to 
> /data/kafka/foo-bar-5.df3626d2d9eb41a2aeb0b8d55d7942bd-delete and is 
> scheduled for deletion (kafka.log.LogManager)
> [2021-10-21 18:18:17,900] INFO [Log partition=foo-bar-5, dir=/data/kafka] 
> Found deletable segments with base offsets [3877066] due to retention time 
> 17280ms breach (kafka.log.Log)[2021-10-21 18:18:17,901] INFO [Log 
> partition=foo-bar-5, dir=/data/kafka] Scheduling segments for deletion 
>