[ https://issues.apache.org/jira/browse/KAFKA-13544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464090#comment-17464090 ]
Guozhang Wang commented on KAFKA-13544: --------------------------------------- Hi @joecqupt thanks for pinging me on slack, I've added you to the contributor list too so that you can assign yourself to tickets in the future. Thanks for your contribution! > Deadlock during shutting down kafka broker because of connectivity problem > with zookeeper > ------------------------------------------------------------------------------------------ > > Key: KAFKA-13544 > URL: https://issues.apache.org/jira/browse/KAFKA-13544 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.8.1 > Reporter: Andrei Lakhmanets > Assignee: joecqupt > Priority: Major > Fix For: 3.1.0, 3.0.1, 2.8.2 > > Attachments: kafka_broker_logs.log, kafka_broker_stackdump.txt > > > Hi team, > *Kafka version:* 2.8.1 > *Configuration:* 3 kafka brokers in different availability zones and 3 > zookeeper brokers in different availability zones. > I faced with deadlock in kafka. I've attached stack dump of the kafka state > to this ticket. The locked threads are "feature-zk-node-event-process-thread" > and "kafka-shutdown-hook". > *Context:* > My kafka cluster had connectivity problems with zookeeper and in the logs I > saw the next exception: > The stacktrace: > {code:java} > [2021-12-06 18:31:14,629] WARN Unable to reconnect to ZooKeeper service, > session 0x10000039563000f has expired (org.apache.zookeeper.ClientCnxn) > [2021-12-06 18:31:14,629] INFO Unable to reconnect to ZooKeeper service, > session 0x10000039563000f has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2021-12-06 18:31:14,629] INFO EventThread shut down for session: > 0x10000039563000f (org.apache.zookeeper.ClientCnxn) > [2021-12-06 18:31:14,631] INFO [ZooKeeperClient Kafka server] Session > expired. (kafka.zookeeper.ZooKeeperClient) > [2021-12-06 18:31:14,632] ERROR [feature-zk-node-event-process-thread]: > Failed to process feature ZK node change event. The broker will eventually > exit. > (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread) > kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either > before or while waiting for connection > at > kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:279) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$waitUntilConnected$1(ZooKeeperClient.scala:261) > at > kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:261) > at > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1797) > at > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1767) > at > kafka.zk.KafkaZkClient.retryRequestUntilConnected(KafkaZkClient.scala:1762) > at kafka.zk.KafkaZkClient.getDataAndStat(KafkaZkClient.scala:771) > at kafka.zk.KafkaZkClient.getDataAndVersion(KafkaZkClient.scala:755) > at > kafka.server.FinalizedFeatureChangeListener$FeatureCacheUpdater.updateLatestOrThrow(FinalizedFeatureChangeListener.scala:74) > at > kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread.doWork(FinalizedFeatureChangeListener.scala:147) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) {code} > The exception is thrown in feature-zk-node-event-process-thread thread and it > is catched in method > FinalizedFeatureChangeListener.ChangeNotificationProcessorThread.doWork and > then doWork method throws FatalExitError(1). > The FatalExitError catched in ShutdownableThread.run method and call > Exit.exit(e.statusCode()) which calls System.exit under the hood. > The stackdump of "feature-zk-node-event-process-thread" thread: > {code:java} > "feature-zk-node-event-process-thread" #23 prio=5 os_prio=0 cpu=163.19ms > elapsed=1563046.32s tid=0x00007fd0dcdec800 nid=0x2088 in Object.wait() > [0x00007fd07e2c1000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base@11.0.11/Native Method) > - waiting on <no object reference available> > at java.lang.Thread.join(java.base@11.0.11/Thread.java:1300) > - waiting to re-lock in wait() <0x0000000088b9d3c8> (a > org.apache.kafka.common.utils.KafkaThread) > at java.lang.Thread.join(java.base@11.0.11/Thread.java:1375) > at > java.lang.ApplicationShutdownHooks.runHooks(java.base@11.0.11/ApplicationShutdownHooks.java:107) > at > java.lang.ApplicationShutdownHooks$1.run(java.base@11.0.11/ApplicationShutdownHooks.java:46) > at java.lang.Shutdown.runHooks(java.base@11.0.11/Shutdown.java:130) > at java.lang.Shutdown.exit(java.base@11.0.11/Shutdown.java:174) > - locked <0x00000000806872f8> (a java.lang.Class for java.lang.Shutdown) > at java.lang.Runtime.exit(java.base@11.0.11/Runtime.java:116) > at java.lang.System.exit(java.base@11.0.11/System.java:1752) > at org.apache.kafka.common.utils.Exit$2.execute(Exit.java:43) > at org.apache.kafka.common.utils.Exit.exit(Exit.java:66) > at kafka.utils.Exit$.exit(Exit.scala:28) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:102) {code} > System.exit method before shutting down virtual machine calls shutdown hooks > and spawns the thread "kafka-shutdown-hook". > The thread "kafka-shutdown-hook" calls kafka.server.KafkaServer.shutdown > which during work calls > {code:java} > if (featureChangeListener != null) > CoreUtils.swallow(featureChangeListener.close(), this) {code} > Where featureChangeListener.close is > {code:java} > def close(): Unit = { > zkClient.unregisterStateChangeHandler(ZkStateChangeHandler.name) > zkClient.unregisterZNodeChangeHandler(FeatureZNodeChangeHandler.path) > queue.clear() > thread.shutdown() > thread.join() > } {code} > It enteres in the deadlock on the line "thread.join()" because it waits that > the thread "feature-zk-node-event-process-thread" should die but this thread > spawned the thead "kafka-shutdown-hook" and can't die until the thread > "kafka-shutdown-hook" is finished. > The stack dump of the thread "kafka-shutdown-hook": > {code:java} > "kafka-shutdown-hook" #18 prio=5 os_prio=0 cpu=134.10ms elapsed=50906.28s > tid=0x00007fd0ac15c800 nid=0x6d3bb in Object.wait() [0x00007fd02dcba000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(java.base@11.0.11/Native Method) > - waiting on <no object reference available> > at java.lang.Thread.join(java.base@11.0.11/Thread.java:1300) > - waiting to re-lock in wait() <0x0000000088973e98> (a > kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread) > at java.lang.Thread.join(java.base@11.0.11/Thread.java:1375) > at > kafka.server.FinalizedFeatureChangeListener.close(FinalizedFeatureChangeListener.scala:245) > at kafka.server.KafkaServer.$anonfun$shutdown$23(KafkaServer.scala:717) > at > kafka.server.KafkaServer$$Lambda$1681/0x000000084082b040.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:68) > at kafka.server.KafkaServer.shutdown(KafkaServer.scala:717) > at kafka.Kafka$.$anonfun$main$3(Kafka.scala:100) > at kafka.Kafka$$$Lambda$205/0x0000000840233840.apply$mcV$sp(Unknown > Source) > at kafka.utils.Exit$.$anonfun$addShutdownHook$1(Exit.scala:38) > at kafka.Kafka$$$Lambda$206/0x0000000840233c40.run(Unknown Source) > at java.lang.Thread.run(java.base@11.0.11/Thread.java:829){code} -- This message was sent by Atlassian Jira (v8.20.1#820001)