wangzhigang1999 opened a new pull request, #7150:
URL: https://github.com/apache/kyuubi/pull/7150

   - Introduced a new configuration property 
`kyuubi.session.engine.shutdown.watchdog.timeout` to manage the maximum wait 
time for engine shutdown.
   - Updated the `SparkSQLEngine` to utilize the new watchdog feature.
   - Minor adjustments to existing configurations and documentation to reflect 
the new feature.
   
   
   ### Why are the changes needed?
   Currently, there are scenarios where the engine should exit but fails to do 
so due to various reasons, and these scenarios cannot be exhaustively 
enumerated. For example, see this discussion: 
https://github.com/apache/kyuubi/discussions/6992#discussioncomment-13775648, 
and these issues: https://github.com/apache/kyuubi/issues/4280, 
https://github.com/apache/kyuubi/issues/7019.
   
   Similarly, we encountered this issue in production. For example, in the 
following log, after SparkContext stopped, the entire process should have 
executed the shutdown hook and exited. However, due to an abnormal Ranger 
thread, the process was blocked for over ten days until it eventually exhausted 
the ECS resources and was finally discovered.
   
   <img width="2844" height="1112" alt="image" 
src="https://github.com/user-attachments/assets/f09be191-d6f6-43dc-a462-c1d81a4f6758";
 />
   
   
   
   ### How was this patch tested?
   
   For the ThreadDumpUtils utility class, I added unit tests; for the overall 
process, I conducted an end-to-end test:
   
   1.  Add a thread that never exits to the source code.
   <img width="2766" height="1996" alt="image" 
src="https://github.com/user-attachments/assets/34e2e02a-88ce-4afa-a72c-4a9886c1072f";
 />
   
   2. Keeping our configuration unchanged, first start an engine, then use 
kyuubi-ctl in the command line to delete it. Below is the complete log, showing 
that after one minute, the engine was successfully terminated.
   ```
   25/07/21 20:20:55 WARN [main-EventThread] ZookeeperDiscoveryClient: This 
Kyuubi instance 
master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:32377 is now 
de-registered from ZooKeeper. The server will be shut down after the last 
client session completes.
   25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Shutdown Watchdog 
activated. Engine will be forcefully terminated if graceful shutdown exceeds 1 
minute(s).
   25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service: 
[SparkTBinaryFrontend] is stopping.
   25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: 
Service: [EngineServiceDiscovery] is stopping.
   25/07/21 20:20:55 INFO [Curator-Framework-0] CuratorFrameworkImpl: 
backgroundOperationsLoop exiting
   25/07/21 20:20:55 INFO [main-EventThread] ZooKeeper: Session: 
0x10000011bf90006 closed
   25/07/21 20:20:55 INFO [main-EventThread] EngineServiceDiscovery: 
Service[EngineServiceDiscovery] is stopped.
   25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: 
Service[SparkTBinaryFrontend] is stopped.
   25/07/21 20:20:55 INFO [main-EventThread] SparkTBinaryFrontendService: 
SparkTBinaryFrontend has stopped
   25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: Service: 
[SparkSQLBackendService] is stopping.
   25/07/21 20:20:55 INFO [main-EventThread] SparkSQLBackendService: Service: 
[SparkSQLSessionManager] is stopping.
   25/07/21 20:20:55 INFO [main-EventThread] SparkSQLSessionManager: Service: 
[SparkSQLOperationManager] is stopping.
   25/07/21 20:20:55 INFO [main-EventThread] SparkSQLOperationManager: 
Service[SparkSQLOperationManager] is stopped.
   25/07/21 20:20:55 INFO [main-EventThread] SparkSQLSessionManager: 
Service[SparkSQLSessionManager] is stopped.
   25/07/21 20:20:55 INFO [main-EventThread] SparkSQLBackendService: 
Service[SparkSQLBackendService] is stopped.
   25/07/21 20:20:55 INFO [main-EventThread] SparkSQLEngine: 
Service[SparkSQLEngine] is stopped.
   25/07/21 20:20:55 WARN [main] SparkSQLEngine: Shutdown Watchdog is already 
running, ignoring duplicate start request
   25/07/21 20:20:55 INFO [main] SparkContext: SparkContext is stopping with 
exitCode 0.
   25/07/21 20:20:55 INFO [main-EventThread] FsStats: cmd=getFileStatus, 
src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/tmp/kyuubi/tmp_kyuubi_result/application_1753099605161_0002,
 dst=null, size=0, parameter=null, time-in-ms=5, version=6.8.2-nextarch
   25/07/21 20:20:55 INFO [main] AbstractConnector: Stopped 
Spark@30331109{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
   25/07/21 20:20:55 ERROR [main-EventThread] PersistentNode: Could not delete 
node after close
   java.lang.IllegalStateException: instance must be started before calling 
this method
        at 
org.apache.kyuubi.shaded.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:176)
 ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359)
 ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.processBackgroundCallbackClosedState(PersistentNode.java:192)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.access$800(PersistentNode.java:58)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode$5.processResult(PersistentNode.java:168)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:584)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.access$900(CreateBuilderImpl.java:44)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:534)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
   25/07/21 20:20:55 INFO [main-EventThread] ClientCnxn: EventThread shut down 
for session: 0x10000011bf90006
   25/07/21 20:20:55 INFO [main] SparkUI: Stopped Spark web UI at 
http://master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:46861
   25/07/21 20:20:55 INFO [YARN application state monitor] 
YarnClientSchedulerBackend: Interrupting monitor thread
   25/07/21 20:20:55 INFO [main] YarnClientSchedulerBackend: Shutting down all 
executors
   25/07/21 20:20:55 INFO [dispatcher-CoarseGrainedScheduler] 
YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
   25/07/21 20:20:55 INFO [main] YarnClientSchedulerBackend: YARN client 
scheduler backend Stopped
   25/07/21 20:20:55 INFO [main] FsStats: cmd=upload, 
src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002.inprogress,
 dst=null, size=4259382, parameter=uuid:45f6a812-52a0-411e-985c-611811c8d5b5, 
time-in-ms=36, version=6.8.2-nextarch
   25/07/21 20:20:55 INFO [main] FsStats: cmd=getFileStatus, 
src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002,
 dst=null, size=0, parameter=null, time-in-ms=4, version=6.8.2-nextarch
   25/07/21 20:20:55 INFO [main] FsStats: cmd=rename, 
src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002.inprogress,
 
dst=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002,
 size=0, parameter=null, time-in-ms=7, version=6.8.2-nextarch
   25/07/21 20:20:55 INFO [main] FsStats: cmd=setTime, 
src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0002,
 dst=null, size=0, parameter=null, time-in-ms=5, version=6.8.2-nextarch
   25/07/21 20:20:55 INFO [dispatcher-event-loop-2] 
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
   25/07/21 20:20:55 INFO [main] MemoryStore: MemoryStore cleared
   25/07/21 20:20:55 INFO [main] BlockManager: BlockManager stopped
   25/07/21 20:20:55 INFO [main] BlockManagerMaster: BlockManagerMaster stopped
   25/07/21 20:20:55 INFO [dispatcher-event-loop-2] 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
   25/07/21 20:20:55 INFO [main] SparkContext: Successfully stopped SparkContext
   25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: EMERGENCY 
SHUTDOWN TRIGGERED
   25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Graceful 
shutdown exceeded 1 minute(s) timeout
   25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Non-daemon 
threads are preventing JVM exit
   25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Initiating 
forced termination...
   25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: === THREAD DUMP 
FOR DIAGNOSTIC ===
   25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: 
   ================== Thread Dump Start ==================
   Timestamp: 2025-07-21 20:21:55
   Total threads: 20
   
   Thread Statistics:
     Non-daemon threads:   2
     Daemon threads:      18
   
   Threads by state:
     RUNNABLE       :   5
     TIMED_WAITING  :   9
     WAITING        :   6
   
   ==================== Non-Daemon Threads ====================
   (These threads prevent JVM from exiting)
   
   Thread: "DestroyJavaVM" #229 
      State: RUNNABLE
   
   Thread: "forever-non-daemon-thread" #88 
      State: TIMED_WAITING
           at java.lang.Thread.sleep(Native Method)
           at 
org.apache.kyuubi.engine.spark.SparkSQLEngine$$anon$1.run(SparkSQLEngine.scala:104)
           at java.lang.Thread.run(Thread.java:750)
   
   ====================== Daemon Threads ======================
   
   Thread: "BTrace Command Queue Processor" #7 daemon
      State: WAITING
      Waiting on: 
<java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@33d0d53>
           at sun.misc.Unsafe.park(Native Method)
           at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
           at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
           at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
           at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   
   Thread: "Finalizer" #3 daemon
      State: WAITING
      Waiting on: <java.lang.ref.ReferenceQueue$Lock@568ebe39>
           at java.lang.Object.wait(Native Method)
           at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
           at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
           at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:188)
   
   Thread: "OkHttp ConnectionPool" #79 daemon
      State: TIMED_WAITING
      Waiting on: <java.util.concurrent.SynchronousQueue$TransferStack@461a0a75>
           at sun.misc.Unsafe.park(Native Method)
           at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
           at 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
           at 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
           at 
java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
           at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   
   Thread: "Okio Watchdog" #78 daemon
      State: TIMED_WAITING
      Waiting on: <java.lang.Class@5586b369>
           at java.lang.Object.wait(Native Method)
           at 
com.aliyun.datalake.external.okio.AsyncTimeout.awaitTimeout(AsyncTimeout.java:347)
           at 
com.aliyun.datalake.external.okio.AsyncTimeout$Watchdog.run(AsyncTimeout.java:312)
   
   Thread: "Reference Handler" #2 daemon
      State: WAITING
      Waiting on: <java.lang.ref.Reference$Lock@6ae02e16>
           at java.lang.Object.wait(Native Method)
           at java.lang.Object.wait(Object.java:502)
           at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
           at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
   
   Thread: "Signal Dispatcher" #4 daemon
      State: RUNNABLE
   
   Thread: "Thread-0" #5 daemon
      State: TIMED_WAITING
           at java.lang.Thread.sleep(Native Method)
           at com.sun.btrace.BTraceRuntime$2.idle(BTraceRuntime.java:402)
           at 
com.sun.btrace.org.jctools.queues.BaseMpscLinkedArrayQueue.drain(BaseMpscLinkedArrayQueue.java:670)
           at com.sun.btrace.BTraceRuntime$4.run(BTraceRuntime.java:442)
           at java.lang.Thread.run(Thread.java:750)
   
   Thread: "Thread-17" #54 daemon
      State: RUNNABLE
   
   Thread: "Thread-19" #57 daemon
      State: RUNNABLE
   
   Thread: "cached-current-time-updater" #56 daemon
      State: TIMED_WAITING
      Waiting on: 
<java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2313e826>
           at sun.misc.Unsafe.park(Native Method)
           at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
           at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
           at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   
   Thread: "dlf-token-refresher-0" #76 daemon
      State: TIMED_WAITING
      Waiting on: 
<java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6574acf4>
           at sun.misc.Unsafe.park(Native Method)
           at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
           at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
           at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   
   Thread: "idle-connection-evictor-1" #66 daemon
      State: TIMED_WAITING
           at java.lang.Thread.sleep(Native Method)
           at java.lang.Thread.sleep(Thread.java:342)
           at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
           at 
com.aliyun.platform.org.apache.hc.core5.util.TimeValue.sleep(TimeValue.java:383)
           at 
com.aliyun.platform.org.apache.hc.client5.http.impl.IdleConnectionEvictor.lambda$new$0(IdleConnectionEvictor.java:60)
           at 
com.aliyun.platform.org.apache.hc.client5.http.impl.IdleConnectionEvictor$$Lambda$1199/196954001.run(Unknown
 Source)
           at java.lang.Thread.run(Thread.java:750)
   
   Thread: 
"org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #55 
daemon
      State: WAITING
      Waiting on: <java.lang.ref.ReferenceQueue$Lock@43ee1ad6>
           at java.lang.Object.wait(Native Method)
           at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
           at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
           at 
org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3762)
           at java.lang.Thread.run(Thread.java:750)
   
   Thread: "pool-29-thread-1" #68 daemon
      State: WAITING
      Waiting on: 
<java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@35cc0548>
           at sun.misc.Unsafe.park(Native Method)
           at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
           at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
           at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   
   Thread: "pool-30-thread-1" #69 daemon
      State: TIMED_WAITING
      Waiting on: 
<java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@35cc0548>
           at sun.misc.Unsafe.park(Native Method)
           at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
           at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
           at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
           at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   
   Thread: "process reaper" #224 daemon
      State: TIMED_WAITING
      Waiting on: <java.util.concurrent.SynchronousQueue$TransferStack@78f9176a>
           at sun.misc.Unsafe.park(Native Method)
           at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
           at 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
           at 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
           at 
java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
           at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   
   Thread: "shutdown-watchdog" #214 daemon
      State: RUNNABLE
           at sun.management.ThreadImpl.dumpThreads0(Native Method)
           at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:503)
           at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:491)
           at 
org.apache.kyuubi.util.ThreadDumpUtils$.getAllExtendedThreadInfo(ThreadDumpUtils.scala:91)
           at 
org.apache.kyuubi.util.ThreadDumpUtils$.dumpToWriter(ThreadDumpUtils.scala:158)
           at 
org.apache.kyuubi.util.ThreadDumpUtils$.dumpToString(ThreadDumpUtils.scala:119)
           at 
org.apache.kyuubi.util.ThreadDumpUtils$.dumpToLogger(ThreadDumpUtils.scala:139)
           at 
org.apache.kyuubi.engine.spark.SparkSQLEngine$$anon$2.run(SparkSQLEngine.scala:281)
   
   Thread: "yarn-scheduler-endpoint" #213 daemon
      State: WAITING
      Waiting on: 
<java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@4542b55f>
           at sun.misc.Unsafe.park(Native Method)
           at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
           at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
           at 
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
           at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:750)
   
   ======================== Summary ========================
   Thread ID | Type | State           | Name
   ----------|------|-----------------|---------------------------
          2 |    D | WAITING         | Reference Handler
          3 |    D | WAITING         | Finalizer
          4 |    D | RUNNABLE        | Signal Dispatcher
          5 |    D | TIMED_WAITING   | Thread-0
          7 |    D | WAITING         | BTrace Command Queue Processor
         54 |    D | RUNNABLE        | Thread-17
         55 |    D | WAITING         | 
org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner
         56 |    D | TIMED_WAITING   | cached-current-time-updater
         57 |    D | RUNNABLE        | Thread-19
         66 |    D | TIMED_WAITING   | idle-connection-evictor-1
         68 |    D | WAITING         | pool-29-thread-1
         69 |    D | TIMED_WAITING   | pool-30-thread-1
         76 |    D | TIMED_WAITING   | dlf-token-refresher-0
         78 |    D | TIMED_WAITING   | Okio Watchdog
         79 |    D | TIMED_WAITING   | OkHttp ConnectionPool
         88 |    U | TIMED_WAITING   | forever-non-daemon-thread
        213 |    D | WAITING         | yarn-scheduler-endpoint
        214 |    D | RUNNABLE        | shutdown-watchdog
        224 |    D | TIMED_WAITING   | process reaper
        229 |    U | RUNNABLE        | DestroyJavaVM
   ================== Thread Dump End ==================
   
   25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: === END OF 
THREAD DUMP ===
   25/07/21 20:21:55 ERROR [shutdown-watchdog] SparkSQLEngine: Forcefully 
terminating JVM now...
   25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped 
o.s.j.s.ServletContextHandler@728535c6{/kyuubi,null,STOPPED,@Spark}
   25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped 
o.s.j.s.ServletContextHandler@bd09a26{/kyuubi/json,null,STOPPED,@Spark}
   25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped 
o.s.j.s.ServletContextHandler@3065efd0{/kyuubi/session,null,STOPPED,@Spark}
   25/07/21 20:21:55 INFO [shutdown-hook-0] ContextHandler: Stopped 
o.s.j.s.ServletContextHandler@61aa6300{/kyuubi/session/json,null,STOPPED,@Spark}
   25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Shutdown hook 
called
   25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Deleting 
directory /tmp/spark-a0697749-d2d8-4cce-83ac-ddcff8d66323
   25/07/21 20:21:55 INFO [shutdown-hook-0] ShutdownHookManager: Deleting 
directory /tmp/spark-8931f08a-f571-45d8-8657-c541d130cb17
   ```
   3. Set kyuubi.session.engine.shutdown.watchdog.timeout to 0, meaning 
disabling this feature, then repeat the steps above, and the engine will hang 
indefinitely without ending. It can be seen that the engine got stuck for about 
7 minutes and was then manually killed.
   
   ```
   25/07/21 20:34:55 INFO [main] SparkContext: Successfully stopped SparkContext
   25/07/21 20:41:32 ERROR [SIGTERM handler] SparkSQLEngine: RECEIVED SIGNAL 
15: TERM
   ```
   
   ```
   25/07/21 20:34:45 WARN [main-EventThread] ZookeeperDiscoveryClient: This 
Kyuubi instance 
master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:34369 is now 
de-registered from ZooKeeper. The server will be shut down after the last 
client session completes.
   25/07/21 20:34:45 INFO [main-EventThread] EngineServiceDiscovery: 1 
connection(s) are active, delay shutdown
   25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] 
SparkTBinaryFrontendService: Received request of closing SessionHandle 
[6a4e8d11-f5d3-4607-83bf-04644fc1e416]
   25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] 
SparkSQLSessionManager: zhigang's SparkSessionImpl with SessionHandle 
[6a4e8d11-f5d3-4607-83bf-04644fc1e416] is closed, current opening sessions 0
   25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] 
SparkSessionImpl: sessionId=6a4e8d11-f5d3-4607-83bf-04644fc1e416, 
sessionRunTime=4.0 min, sessionCpuTime=37 s
   25/07/21 20:34:49 INFO [SparkTBinaryFrontendHandler-Pool: Thread-98] 
SparkTBinaryFrontendService: Finished closing SessionHandle 
[6a4e8d11-f5d3-4607-83bf-04644fc1e416]
   25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Shutdown Watchdog 
is disabled (timeout <= 0).
   25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service: 
[SparkTBinaryFrontend] is stopping.
   25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: 
Service: [EngineServiceDiscovery] is stopping.
   25/07/21 20:34:55 INFO [Curator-Framework-0] CuratorFrameworkImpl: 
backgroundOperationsLoop exiting
   25/07/21 20:34:55 INFO [main-EventThread] ZooKeeper: Session: 
0x10000011bf9000d closed
   25/07/21 20:34:55 INFO [main-EventThread] EngineServiceDiscovery: 
Service[EngineServiceDiscovery] is stopped.
   25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: 
Service[SparkTBinaryFrontend] is stopped.
   25/07/21 20:34:55 INFO [main-EventThread] SparkTBinaryFrontendService: 
SparkTBinaryFrontend has stopped
   25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: Service: 
[SparkSQLBackendService] is stopping.
   25/07/21 20:34:55 INFO [main-EventThread] SparkSQLBackendService: Service: 
[SparkSQLSessionManager] is stopping.
   25/07/21 20:34:55 INFO [main-EventThread] SparkSQLSessionManager: Service: 
[SparkSQLOperationManager] is stopping.
   25/07/21 20:34:55 INFO [main-EventThread] SparkSQLOperationManager: 
Service[SparkSQLOperationManager] is stopped.
   25/07/21 20:34:55 INFO [main-EventThread] SparkSQLSessionManager: 
Service[SparkSQLSessionManager] is stopped.
   25/07/21 20:34:55 INFO [main-EventThread] SparkSQLBackendService: 
Service[SparkSQLBackendService] is stopped.
   25/07/21 20:34:55 INFO [main-EventThread] SparkSQLEngine: 
Service[SparkSQLEngine] is stopped.
   25/07/21 20:34:55 INFO [main] SparkSQLEngine: Shutdown Watchdog is disabled 
(timeout <= 0).
   25/07/21 20:34:55 INFO [main] SparkContext: SparkContext is stopping with 
exitCode 0.
   25/07/21 20:34:55 INFO [main] AbstractConnector: Stopped 
Spark@30331109{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
   25/07/21 20:34:55 INFO [main] SparkUI: Stopped Spark web UI at 
http://master-1-1.c-257944aae9ab7038.cn-hangzhou.emr.aliyuncs.com:33101
   25/07/21 20:34:55 INFO [YARN application state monitor] 
YarnClientSchedulerBackend: Interrupting monitor thread
   25/07/21 20:34:55 INFO [main] YarnClientSchedulerBackend: Shutting down all 
executors
   25/07/21 20:34:55 INFO [dispatcher-CoarseGrainedScheduler] 
YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
   25/07/21 20:34:55 INFO [main-EventThread] FsStats: cmd=getFileStatus, 
src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/tmp/kyuubi/tmp_kyuubi_result/application_1753099605161_0003,
 dst=null, size=0, parameter=null, time-in-ms=18, version=6.8.2-nextarch
   25/07/21 20:34:55 ERROR [main-EventThread] PersistentNode: Could not delete 
node after close
   java.lang.IllegalStateException: instance must be started before calling 
this method
        at 
org.apache.kyuubi.shaded.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:176)
 ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359)
 ~[kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.processBackgroundCallbackClosedState(PersistentNode.java:192)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode.access$800(PersistentNode.java:58)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.recipes.nodes.PersistentNode$5.processResult(PersistentNode.java:168)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:584)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl.access$900(CreateBuilderImpl.java:44)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:534)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
        at 
org.apache.kyuubi.shaded.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
 [kyuubi-spark-sql-engine_2.12-1.11.0-SNAPSHOT.jar:?]
   25/07/21 20:34:55 INFO [main-EventThread] ClientCnxn: EventThread shut down 
for session: 0x10000011bf9000d
   25/07/21 20:34:55 INFO [main] YarnClientSchedulerBackend: YARN client 
scheduler backend Stopped
   25/07/21 20:34:55 INFO [main] FsStats: cmd=upload, 
src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003.inprogress,
 dst=null, size=4304759, parameter=uuid:bb87de74-c32e-4886-a8f1-c432052ade6c, 
time-in-ms=42, version=6.8.2-nextarch
   25/07/21 20:34:55 INFO [main] FsStats: cmd=getFileStatus, 
src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003,
 dst=null, size=0, parameter=null, time-in-ms=23, version=6.8.2-nextarch
   25/07/21 20:34:55 INFO [main] FsStats: cmd=rename, 
src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003.inprogress,
 
dst=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003,
 size=0, parameter=null, time-in-ms=10, version=6.8.2-nextarch
   25/07/21 20:34:55 INFO [main] FsStats: cmd=setTime, 
src=oss://doctor-olap-test.cn-hangzhou.oss-dls.aliyuncs.com/spark-history/c-257944aae9ab7038/application_1753099605161_0003,
 dst=null, size=0, parameter=null, time-in-ms=7, version=6.8.2-nextarch
   25/07/21 20:34:55 INFO [dispatcher-event-loop-1] 
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
   25/07/21 20:34:55 INFO [main] MemoryStore: MemoryStore cleared
   25/07/21 20:34:55 INFO [main] BlockManager: BlockManager stopped
   25/07/21 20:34:55 INFO [main] BlockManagerMaster: BlockManagerMaster stopped
   25/07/21 20:34:55 INFO [dispatcher-event-loop-1] 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
   25/07/21 20:34:55 INFO [main] SparkContext: Successfully stopped SparkContext
   25/07/21 20:41:32 ERROR [SIGTERM handler] SparkSQLEngine: RECEIVED SIGNAL 
15: TERM
   25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped 
o.s.j.s.ServletContextHandler@766db6f9{/kyuubi,null,STOPPED,@Spark}
   25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped 
o.s.j.s.ServletContextHandler@244f356{/kyuubi/json,null,STOPPED,@Spark}
   25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped 
o.s.j.s.ServletContextHandler@495f3965{/kyuubi/session,null,STOPPED,@Spark}
   25/07/21 20:41:32 INFO [shutdown-hook-0] ContextHandler: Stopped 
o.s.j.s.ServletContextHandler@7aaf6bfd{/kyuubi/session/json,null,STOPPED,@Spark}
   25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Shutdown hook 
called
   25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Deleting 
directory /tmp/spark-3b2e31df-0f57-4b8b-bdd5-b9218aec09ff
   25/07/21 20:41:32 INFO [shutdown-hook-0] ShutdownHookManager: Deleting 
directory /tmp/spark-4adc1b87-4c50-4b77-9dfe-c40d055f5c8c
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to