[ https://issues.apache.org/jira/browse/KAFKA-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
José Armando García Sancio reassigned KAFKA-14693: -------------------------------------------------- Assignee: José Armando García Sancio > KRaft Controller and ProcessExitingFaultHandler can deadlock shutdown > --------------------------------------------------------------------- > > Key: KAFKA-14693 > URL: https://issues.apache.org/jira/browse/KAFKA-14693 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 3.4.0 > Reporter: José Armando García Sancio > Assignee: José Armando García Sancio > Priority: Critical > Fix For: 3.4.1 > > > h1. Problem > When the kraft controller encounters an error that it cannot handle it calls > {{ProcessExitingFaultHandler}} which calls {{Exit.exit}} which calls > {{{}Runtime.exit{}}}. > Based on the Runtime.exit documentation: > {quote}All registered [shutdown > hooks|https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#addShutdownHook-java.lang.Thread-], > if any, are started in some unspecified order and allowed to run > concurrently until they finish. Once this is done the virtual machine > [halts|https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#halt-int-]. > {quote} > One of the shutdown hooks registered by Kafka is {{{}Server.shutdown(){}}}. > This shutdown hook eventually calls {{{}KafkaEventQueue.close{}}}. This last > close method joins on the controller thread. Unfortunately, the controller > thread also joined waiting for the shutdown hook thread to finish. > Here are an sample thread stacks: > {code:java} > "QuorumControllerEventHandler" #45 prio=5 os_prio=0 cpu=429352.87ms > elapsed=620807.49s allocated=38544M defined_classes=353 > tid=0x00007f5aeb31f800 nid=0x80c in Object.wait() [0x00007f5a658fb000] > java.lang.Thread.State: WAITING (on object monitor) > > > at java.lang.Object.wait(java.base@17.0.5/Native > Method) > - waiting on <no object reference available> > at java.lang.Thread.join(java.base@17.0.5/Thread.java:1304) > - locked <0x00000000a29241f8> (a > org.apache.kafka.common.utils.KafkaThread) > at java.lang.Thread.join(java.base@17.0.5/Thread.java:1372) > at > java.lang.ApplicationShutdownHooks.runHooks(java.base@17.0.5/ApplicationShutdownHooks.java:107) > at > java.lang.ApplicationShutdownHooks$1.run(java.base@17.0.5/ApplicationShutdownHooks.java:46) > at java.lang.Shutdown.runHooks(java.base@17.0.5/Shutdown.java:130) > at java.lang.Shutdown.exit(java.base@17.0.5/Shutdown.java:173) > - locked <0x00000000ffe020b8> (a java.lang.Class for > java.lang.Shutdown) > at java.lang.Runtime.exit(java.base@17.0.5/Runtime.java:115) > at java.lang.System.exit(java.base@17.0.5/System.java:1860) > at org.apache.kafka.common.utils.Exit$2.execute(Exit.java:43) > at org.apache.kafka.common.utils.Exit.exit(Exit.java:66) > at org.apache.kafka.common.utils.Exit.exit(Exit.java:62) > at > org.apache.kafka.server.fault.ProcessExitingFaultHandler.handleFault(ProcessExitingFaultHandler.java:54) > at > org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:891) > at > org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:874) > at > org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:969){code} > and > {code:java} > "kafka-shutdown-hook" #35 prio=5 os_prio=0 cpu=43.42ms elapsed=378593.04s > allocated=4732K defined_classes=74 tid=0x00007f5a7c09d800 nid=0x4f37 in > Object.wait() [0x00007f5a47afd000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base@17.0.5/Native Method) > - waiting on <no object reference available> > at java.lang.Thread.join(java.base@17.0.5/Thread.java:1304) > - locked <0x00000000a272bcb0> (a > org.apache.kafka.common.utils.KafkaThread) > at java.lang.Thread.join(java.base@17.0.5/Thread.java:1372) > at > org.apache.kafka.queue.KafkaEventQueue.close(KafkaEventQueue.java:509) > at > org.apache.kafka.controller.QuorumController.close(QuorumController.java:2553) > at > kafka.server.ControllerServer.shutdown(ControllerServer.scala:521) > at kafka.server.KafkaRaftServer.shutdown(KafkaRaftServer.scala:184) > at kafka.Kafka$.$anonfun$main$3(Kafka.scala:99) > at kafka.Kafka$$$Lambda$406/0x0000000800fb9730.apply$mcV$sp(Unknown > Source) > at kafka.utils.Exit$.$anonfun$addShutdownHook$1(Exit.scala:38) > at kafka.Kafka$$$Lambda$407/0x0000000800fb9a10.run(Unknown Source) > at java.lang.Thread.run(java.base@17.0.5/Thread.java:833) > at > org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:64) {code} > h1. Possible Solution > A possible solution is to have the controller's unhandled fault handler call > {{Runtime.halt}} instead of {{{}Runtime.exit{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)