wangzhigang1999 opened a new pull request, #7150: URL: https://github.com/apache/kyuubi/pull/7150
- Introduced a new configuration property `kyuubi.session.engine.shutdown.watchdog.timeout` to manage the maximum wait time for engine shutdown. - Updated the `SparkSQLEngine` to utilize the new watchdog feature. - Minor adjustments to existing configurations and documentation to reflect the new feature. ### Why are the changes needed? Currently, there are scenarios where the engine should exit but fails to do so due to various reasons, and these scenarios cannot be exhaustively enumerated. For example, see this discussion: https://github.com/apache/kyuubi/discussions/6992#discussioncomment-13775648, and these issues: https://github.com/apache/kyuubi/issues/4280, https://github.com/apache/kyuubi/issues/7019. Similarly, we encountered this issue in production. For example, in the following log, after SparkContext stopped, the entire process should have executed the shutdown hook and exited. However, due to an abnormal Ranger thread, the process was blocked for over ten days until it eventually exhausted the ECS resources and was finally discovered. <img width="2844" height="1112" alt="image" src="https://github.com/user-attachments/assets/f09be191-d6f6-43dc-a462-c1d81a4f6758" /> ### How was this patch tested? For the ThreadDumpUtils utility class, I added unit tests; for the overall process, I conducted an end-to-end test: 1. Add a thread that never exits to the source code. <img width="2766" height="1996" alt="image" src="https://github.com/user-attachments/assets/34e2e02a-88ce-4afa-a72c-4a9886c1072f" /> 2. Keeping our configuration unchanged, first start an engine, then use kyuubi-ctl in the command line to delete it. Below is the complete log, showing that after one minute, the engine was successfully terminated. ``` 25/07/21 20:20:55 WARN [main-EventThread] ZookeeperDiscoveryClient: This Kyuubi instance master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:32377 is now de-registered from ZooKeeper. The server will be shut down after the last client session completes. 25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Shutdown Watchdog activated. Engine will be forcefully terminated if graceful shutdown exceeds 1 minute(s). 25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkTBinaryFrontend] is stopping. 25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service: [EngineServiceDiscovery] is stopping. 25/07/21 20:20:55 INFO [Curator-Framework-0] CuratorFrameworkImpl: backgroundOperationsLoop exiting 25/07/21 20:20:55 INFO [main-EventThread] ZooKeeper: Session: 0x10000011bf90006 closed 25/07/21 20:20:55 INFO [main-EventThread] EngineServiceDiscovery: Service[EngineServiceDiscovery] is stopped. 25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is stopped. 25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: SparkTBinaryFrontend has stopped 25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkSQLBackendService] is stopping. 25/07/21 20:20:55 INFO [main-EventThread] SparkSQLBackendService: Service: [SparkSQLSessionManager] is stopping. 25/07/21 20:20:55 INFO [main-EventThread] SparkSQLSessionManager: Service: [SparkSQLOperationManager] is stopping. 25/07/21 20:20:55 INFO [main-EventThread] SparkSQLOperationManager: Service[SparkSQLOperationManager] is stopped. 25/07/21 20:20:55 INFO [main-EventThread] SparkSQLSessionManager: Service[SparkSQLSessionManager] is stopped. 25/07/21 20:20:55 INFO [main-EventThread] SparkSQLBackendService: Service[SparkSQLBackendService] is stopped. 25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service[SparkSQLEngine] is stopped. 25/07/21 20:20:55 WARN [main] SparkSQLEngine: Shutdown Watchdog is already running, ignoring duplicate start request 25/07/21 20:20:55 INFO [main] SparkContext: SparkContext is stopping with exitCode 0. 25/07/21 20:20:55 INFO [main-EventThread] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/tmp/kyuubi/tmp_kyuubi_result/application_1753099605161_0002, dst=null, size=0, parameter=null, time-in-ms=5, version=6.8.2-nextarch 25/07/21 20:20:55 INFO [main] AbstractConnector: Stopped Spark@30331109{HTTP/1.1, (http/1.1)}{0.0.0.0:0} 25/07/21 20:20:55 ERROR [main-EventThread] PersistentNode: Could not delete node after close java.lang.IllegalStateException: instance must be started before calling this method at org.apache.kyuubi.shaded.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.processBackgroundCallbackClosedState(PersistentNode.java:192) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.access$800(PersistentNode.java:58) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode$5.processResult(PersistentNode.java:168) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:584) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.access$900(CreateBuilderImpl.java:44) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:534) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] 25/07/21 20:20:55 INFO [main-EventThread] ClientCnxn: EventThread shut down for session: 0x10000011bf90006 25/07/21 20:20:55 INFO [main] SparkUI: Stopped Spark web UI at http://master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:46861 25/07/21 20:20:55 INFO [YARN application state monitor] YarnClientSchedulerBackend: Interrupting monitor thread 25/07/21 20:20:55 INFO [main] YarnClientSchedulerBackend: Shutting down all executors 25/07/21 20:20:55 INFO [dispatcher-CoarseGrainedScheduler] YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 25/07/21 20:20:55 INFO [main] YarnClientSchedulerBackend: YARN client scheduler backend Stopped 25/07/21 20:20:55 INFO [main] FsStats: cmd=upload, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002.inprogress, dst=null, size=4259382, parameter=uuid:45f6a812-52a0-411e-985c-611811c8d5b5, time-in-ms=36, version=6.8.2-nextarch 25/07/21 20:20:55 INFO [main] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002, dst=null, size=0, parameter=null, time-in-ms=4, version=6.8.2-nextarch 25/07/21 20:20:55 INFO [main] FsStats: cmd=rename, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002.inprogress, dst=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002, size=0, parameter=null, time-in-ms=7, version=6.8.2-nextarch 25/07/21 20:20:55 INFO [main] FsStats: cmd=setTime, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002, dst=null, size=0, parameter=null, time-in-ms=5, version=6.8.2-nextarch 25/07/21 20:20:55 INFO [dispatcher-event-loop-2] MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 25/07/21 20:20:55 INFO [main] MemoryStore: MemoryStore cleared 25/07/21 20:20:55 INFO [main] BlockManager: BlockManager stopped 25/07/21 20:20:55 INFO [main] BlockManagerMaster: BlockManagerMaster stopped 25/07/21 20:20:55 INFO [dispatcher-event-loop-2] OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 25/07/21 20:20:55 INFO [main] SparkContext: Successfully stopped SparkContext 25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: EMERGENCY SHUTDOWN TRIGGERED 25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Graceful shutdown exceeded 1 minute(s) timeout 25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Non-daemon threads are preventing JVM exit 25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Initiating forced termination... 25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: === THREAD DUMP FOR DIAGNOSTIC === 25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: ================== Thread Dump Start ================== Timestamp: 2025-07-21 20:21:55 Total threads: 20 Thread Statistics: Non-daemon threads: 2 Daemon threads: 18 Threads by state: RUNNABLE : 5 TIMED_WAITING : 9 WAITING : 6 ==================== Non-Daemon Threads ==================== (These threads prevent JVM from exiting) Thread: "DestroyJavaVM" #229 State: RUNNABLE Thread: "forever-non-daemon-thread" #88 State: TIMED_WAITING at java.lang.Thread.sleep(Native Method) at org.apache.kyuubi.engine.spark.SparkSQLEngine$$anon$1.run(SparkSQLEngine.scala:104) at java.lang.Thread.run(Thread.java:750) ====================== Daemon Threads ====================== Thread: "BTrace Command Queue Processor" #7 daemon State: WAITING Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@33d0d53> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Thread: "Finalizer" #3 daemon State: WAITING Waiting on: <java.lang.ref.ReferenceQueue$Lock@568ebe39> at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:188) Thread: "OkHttp ConnectionPool" #79 daemon State: TIMED_WAITING Waiting on: <java.util.concurrent.SynchronousQueue$TransferStack@461a0a75> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Thread: "Okio Watchdog" #78 daemon State: TIMED_WAITING Waiting on: <java.lang.Class@5586b369> at java.lang.Object.wait(Native Method) at com.aliyun.datalake.external.okio.AsyncTimeout.awaitTimeout(AsyncTimeout.java:347) at com.aliyun.datalake.external.okio.AsyncTimeout$Watchdog.run(AsyncTimeout.java:312) Thread: "Reference Handler" #2 daemon State: WAITING Waiting on: <java.lang.ref.Reference$Lock@6ae02e16> at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at java.lang.ref.Reference.tryHandlePending(Reference.java:191) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153) Thread: "Signal Dispatcher" #4 daemon State: RUNNABLE Thread: "Thread-0" #5 daemon State: TIMED_WAITING at java.lang.Thread.sleep(Native Method) at com.sun.btrace.BTraceRuntime$2.idle(BTraceRuntime.java:402) at com.sun.btrace.org.jctools.queues.BaseMpscLinkedArrayQueue.drain(BaseMpscLinkedArrayQueue.java:670) at com.sun.btrace.BTraceRuntime$4.run(BTraceRuntime.java:442) at java.lang.Thread.run(Thread.java:750) Thread: "Thread-17" #54 daemon State: RUNNABLE Thread: "Thread-19" #57 daemon State: RUNNABLE Thread: "cached-current-time-updater" #56 daemon State: TIMED_WAITING Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2313e826> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Thread: "dlf-token-refresher-0" #76 daemon State: TIMED_WAITING Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6574acf4> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Thread: "idle-connection-evictor-1" #66 daemon State: TIMED_WAITING at java.lang.Thread.sleep(Native Method) at java.lang.Thread.sleep(Thread.java:342) at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) at com.aliyun.platform.org.apache.hc.core5.util.TimeValue.sleep(TimeValue.java:383) at com.aliyun.platform.org.apache.hc.client5.http.impl.IdleConnectionEvictor.lambda$new$0(IdleConnectionEvictor.java:60) at com.aliyun.platform.org.apache.hc.client5.http.impl.IdleConnectionEvictor$$Lambda$1199/196954001.run(Unknown Source) at java.lang.Thread.run(Thread.java:750) Thread: "org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #55 daemon State: WAITING Waiting on: <java.lang.ref.ReferenceQueue$Lock@43ee1ad6> at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165) at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3762) at java.lang.Thread.run(Thread.java:750) Thread: "pool-29-thread-1" #68 daemon State: WAITING Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@35cc0548> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Thread: "pool-30-thread-1" #69 daemon State: TIMED_WAITING Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@35cc0548> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Thread: "process reaper" #224 daemon State: TIMED_WAITING Waiting on: <java.util.concurrent.SynchronousQueue$TransferStack@78f9176a> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Thread: "shutdown-watchdog" #214 daemon State: RUNNABLE at sun.management.ThreadImpl.dumpThreads0(Native Method) at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:503) at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:491) at org.apache.kyuubi.util.ThreadDumpUtils$.getAllExtendedThreadInfo(ThreadDumpUtils.scala:91) at org.apache.kyuubi.util.ThreadDumpUtils$.dumpToWriter(ThreadDumpUtils.scala:158) at org.apache.kyuubi.util.ThreadDumpUtils$.dumpToString(ThreadDumpUtils.scala:119) at org.apache.kyuubi.util.ThreadDumpUtils$.dumpToLogger(ThreadDumpUtils.scala:139) at org.apache.kyuubi.engine.spark.SparkSQLEngine$$anon$2.run(SparkSQLEngine.scala:281) Thread: "yarn-scheduler-endpoint" #213 daemon State: WAITING Waiting on: <java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4542b55f> at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ======================== Summary ======================== Thread ID | Type | State | Name ----------|------|-----------------|--------------------------- 2 | D | WAITING | Reference Handler 3 | D | WAITING | Finalizer 4 | D | RUNNABLE | Signal Dispatcher 5 | D | TIMED_WAITING | Thread-0 7 | D | WAITING | BTrace Command Queue Processor 54 | D | RUNNABLE | Thread-17 55 | D | WAITING | org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner 56 | D | TIMED_WAITING | cached-current-time-updater 57 | D | RUNNABLE | Thread-19 66 | D | TIMED_WAITING | idle-connection-evictor-1 68 | D | WAITING | pool-29-thread-1 69 | D | TIMED_WAITING | pool-30-thread-1 76 | D | TIMED_WAITING | dlf-token-refresher-0 78 | D | TIMED_WAITING | Okio Watchdog 79 | D | TIMED_WAITING | OkHttp ConnectionPool 88 | U | TIMED_WAITING | forever-non-daemon-thread 213 | D | WAITING | yarn-scheduler-endpoint 214 | D | RUNNABLE | shutdown-watchdog 224 | D | TIMED_WAITING | process reaper 229 | U | RUNNABLE | DestroyJavaVM ================== Thread Dump End ================== 25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: === END OF THREAD DUMP === 25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Forcefully terminating JVM now... 25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@728535c6{/kyuubi,null,STOPPED,@Spark} 25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@bd09a26{/kyuubi/json,null,STOPPED,@Spark} 25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@3065efd0{/kyuubi/session,null,STOPPED,@Spark} 25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@61aa6300{/kyuubi/session/json,null,STOPPED,@Spark} 25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Shutdown hook called 25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-a0697749-d2d8-4cce-83ac-ddcff8d66323 25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-8931f08a-f571-45d8-8657-c541d130cb17 ``` 3. Set kyuubi.session.engine.shutdown.watchdog.timeout to 0, meaning disabling this feature, then repeat the steps above, and the engine will hang indefinitely without ending. It can be seen that the engine got stuck for about 7 minutes and was then manually killed. ``` 25/07/21 20:34:55 INFO [main] SparkContext: Successfully stopped SparkContext 25/07/21 20:41:32 ERROR [SIGTERM handler] SparkSQLEngine: RECEIVED SIGNAL 15: TERM ``` ``` 25/07/21 20:34:45 WARN [main-EventThread] ZookeeperDiscoveryClient: This Kyuubi instance master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:34369 is now de-registered from ZooKeeper. The server will be shut down after the last client session completes. 25/07/21 20:34:45 INFO [main-EventThread] EngineServiceDiscovery: 1 connection(s) are active, delay shutdown 25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkTBinaryFrontendService: Received request of closing SessionHandle [6a4e8d11-f5d3-4607-83bf-04644fc1e416] 25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkSQLSessionManager: zhigang's SparkSessionImpl with SessionHandle [6a4e8d11-f5d3-4607-83bf-04644fc1e416] is closed, current opening sessions 0 25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkSessionImpl: sessionId=6a4e8d11-f5d3-4607-83bf-04644fc1e416, sessionRunTime=4.0 min, sessionCpuTime=37 s 25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] SparkTBinaryFrontendService: Finished closing SessionHandle [6a4e8d11-f5d3-4607-83bf-04644fc1e416] 25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Shutdown Watchdog is disabled (timeout <= 0). 25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkTBinaryFrontend] is stopping. 25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service: [EngineServiceDiscovery] is stopping. 25/07/21 20:34:55 INFO [Curator-Framework-0] CuratorFrameworkImpl: backgroundOperationsLoop exiting 25/07/21 20:34:55 INFO [main-EventThread] ZooKeeper: Session: 0x10000011bf9000d closed 25/07/21 20:34:55 INFO [main-EventThread] EngineServiceDiscovery: Service[EngineServiceDiscovery] is stopped. 25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: Service[SparkTBinaryFrontend] is stopped. 25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: SparkTBinaryFrontend has stopped 25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service: [SparkSQLBackendService] is stopping. 25/07/21 20:34:55 INFO [main-EventThread] SparkSQLBackendService: Service: [SparkSQLSessionManager] is stopping. 25/07/21 20:34:55 INFO [main-EventThread] SparkSQLSessionManager: Service: [SparkSQLOperationManager] is stopping. 25/07/21 20:34:55 INFO [main-EventThread] SparkSQLOperationManager: Service[SparkSQLOperationManager] is stopped. 25/07/21 20:34:55 INFO [main-EventThread] SparkSQLSessionManager: Service[SparkSQLSessionManager] is stopped. 25/07/21 20:34:55 INFO [main-EventThread] SparkSQLBackendService: Service[SparkSQLBackendService] is stopped. 25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service[SparkSQLEngine] is stopped. 25/07/21 20:34:55 INFO [main] SparkSQLEngine: Shutdown Watchdog is disabled (timeout <= 0). 25/07/21 20:34:55 INFO [main] SparkContext: SparkContext is stopping with exitCode 0. 25/07/21 20:34:55 INFO [main] AbstractConnector: Stopped Spark@30331109{HTTP/1.1, (http/1.1)}{0.0.0.0:0} 25/07/21 20:34:55 INFO [main] SparkUI: Stopped Spark web UI at http://master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:33101 25/07/21 20:34:55 INFO [YARN application state monitor] YarnClientSchedulerBackend: Interrupting monitor thread 25/07/21 20:34:55 INFO [main] YarnClientSchedulerBackend: Shutting down all executors 25/07/21 20:34:55 INFO [dispatcher-CoarseGrainedScheduler] YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 25/07/21 20:34:55 INFO [main-EventThread] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/tmp/kyuubi/tmp_kyuubi_result/application_1753099605161_0003, dst=null, size=0, parameter=null, time-in-ms=18, version=6.8.2-nextarch 25/07/21 20:34:55 ERROR [main-EventThread] PersistentNode: Could not delete node after close java.lang.IllegalStateException: instance must be started before calling this method at org.apache.kyuubi.shaded.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:176) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359) ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.processBackgroundCallbackClosedState(PersistentNode.java:192) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.access$800(PersistentNode.java:58) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode$5.processResult(PersistentNode.java:168) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:584) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.access$900(CreateBuilderImpl.java:44) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:534) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] at org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?] 25/07/21 20:34:55 INFO [main-EventThread] ClientCnxn: EventThread shut down for session: 0x10000011bf9000d 25/07/21 20:34:55 INFO [main] YarnClientSchedulerBackend: YARN client scheduler backend Stopped 25/07/21 20:34:55 INFO [main] FsStats: cmd=upload, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003.inprogress, dst=null, size=4304759, parameter=uuid:bb87de74-c32e-4886-a8f1-c432052ade6c, time-in-ms=42, version=6.8.2-nextarch 25/07/21 20:34:55 INFO [main] FsStats: cmd=getFileStatus, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003, dst=null, size=0, parameter=null, time-in-ms=23, version=6.8.2-nextarch 25/07/21 20:34:55 INFO [main] FsStats: cmd=rename, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003.inprogress, dst=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003, size=0, parameter=null, time-in-ms=10, version=6.8.2-nextarch 25/07/21 20:34:55 INFO [main] FsStats: cmd=setTime, src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003, dst=null, size=0, parameter=null, time-in-ms=7, version=6.8.2-nextarch 25/07/21 20:34:55 INFO [dispatcher-event-loop-1] MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 25/07/21 20:34:55 INFO [main] MemoryStore: MemoryStore cleared 25/07/21 20:34:55 INFO [main] BlockManager: BlockManager stopped 25/07/21 20:34:55 INFO [main] BlockManagerMaster: BlockManagerMaster stopped 25/07/21 20:34:55 INFO [dispatcher-event-loop-1] OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 25/07/21 20:34:55 INFO [main] SparkContext: Successfully stopped SparkContext 25/07/21 20:41:32 ERROR [SIGTERM handler] SparkSQLEngine: RECEIVED SIGNAL 15: TERM 25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@766db6f9{/kyuubi,null,STOPPED,@Spark} 25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@244f356{/kyuubi/json,null,STOPPED,@Spark} 25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@495f3965{/kyuubi/session,null,STOPPED,@Spark} 25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped o.s.j.s.ServletContextHandler@7aaf6bfd{/kyuubi/session/json,null,STOPPED,@Spark} 25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Shutdown hook called 25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-3b2e31df-0f57-4b8b-bdd5-b9218aec09ff 25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Deleting directory /tmp/spark-4adc1b87-4c50-4b77-9dfe-c40d055f5c8c ``` ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
