Ashish Kumar created HDDS-10643:
-----------------------------------

             Summary: SCM fails to stop gracefully
                 Key: HDDS-10643
                 URL: https://issues.apache.org/jira/browse/HDDS-10643
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Ashish Kumar
            Assignee: Ashish Kumar


When SCM stop is called, SCM first stops raft server which internally invokes 
state machine to 
[close|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java#L444].
 

Since this call is from ratis, statemachine tries to 
[terminate|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1796]
 SCM by invoking 
[System.exit|[https://github.com/apache/ratis/blob/f40424422b692349b5496ee30e24335c8186093b/ratis-common/src/main/java/org/apache/ratis/util/ExitUtils.java#L138]
 
|https://github.com/apache/ratis/blob/f40424422b692349b5496ee30e24335c8186093b/ratis-common/src/main/java/org/apache/ratis/util/ExitUtils.java#L138].]call.
        But System.exit waits for shutdown-hook to gets completed. 
Shutdown-hook is waiting for main thread(waiting for raft server close) to 
complete. So both threads are waiting for each other and leads to deadlock 
situation.

It depends on the timeout for shutdown-hook after that it interrupts the thread 
abruptly.

Currently after this SCM shuts down which is not graceful exit.

Below is the stack:

StateMachineUpdater:
"fa8d8513-6849-46f1-8088-e091f305efda@group-4D11F4CFC172-StateMachineUpdater" 
#58 daemon prio=5 os_prio=0 cpu=520.27ms elapsed=820.33s tid=0x00007fdc96449000 
nid=0x188160 waiting for monitor entry  [0x00007fdc60bd6000]
   java.lang.Thread.State: *BLOCKED* (on object monitor)
at java.lang.Shutdown.exit(java.base@11.0.3/Shutdown.java:173)
- waiting to lock <{*}0x000000072d15c608{*}> (a java.lang.Class for 
java.lang.Shutdown)
at java.lang.Runtime.exit(java.base@11.0.3/Runtime.java:115)
at java.lang.System.exit(java.base@11.0.3/System.java:1746)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:138)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:151)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:155)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.shutDown(StorageContainerManager.java:1758)
at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.close(SCMStateMachine.java:439)
at 
org.apache.ratis.server.impl.StateMachineUpdater.stop(StateMachineUpdater.java:134)
at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:189)
at java.lang.Thread.run(java.base@11.0.3/Thread.java:834)
 
Shutdown-Hook:
"SIGTERM handler" #468 daemon prio=9 os_prio=0 cpu=1.24ms elapsed=20.21s 
tid=0x0000000000e99000 nid=0x18af9c in Object.wait()  [0x00007fdc4fd00000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(java.base@11.0.3/Native Method)
    - waiting on <no object reference available>
    at java.lang.Thread.join(java.base@11.0.3/Thread.java:1305)
    - waiting to re-lock in wait() <0x000000071165a3f8> (a 
org.apache.hadoop.ozone.util.ShutdownHookManager$1)
    at java.lang.Thread.join(java.base@11.0.3/Thread.java:1379)
    at 
java.lang.ApplicationShutdownHooks.runHooks(java.base@11.0.3/ApplicationShutdownHooks.java:107)
    at 
java.lang.ApplicationShutdownHooks$1.run(java.base@11.0.3/ApplicationShutdownHooks.java:46)
    at java.lang.Shutdown.runHooks(java.base@11.0.3/Shutdown.java:130)
    at java.lang.Shutdown.exit(java.base@11.0.3/Shutdown.java:174)
    - locked <{*}0x000000072d15c608{*}> (a java.lang.Class for 
java.lang.Shutdown)
    at java.lang.Terminator$1.handle(java.base@11.0.3/Terminator.java:51)
    at 
sun.misc.Signal$SunMiscHandler.handle(jdk.unsupported@11.0.3/Signal.java:228)
    at 
org.apache.hadoop.hdds.utils.SignalLogger$Handler.handle(SignalLogger.java:62)
    at 
sun.misc.Signal$InternalMiscHandler.handle(jdk.unsupported@11.0.3/Signal.java:198)
    at jdk.internal.misc.Signal$1.run(java.base@11.0.3/Signal.java:220)
    at java.lang.Thread.run(java.base@11.0.3/Thread.java:834)
 
Raftserver stop:
"shutdown-hook-0" #469 daemon prio=5 os_prio=0 cpu=59.00ms elapsed=20.19s 
tid=0x00007fdc795a6800 nid=0x18afa0 waiting on condition  [0x00007fdc46579000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@11.0.3/Native Method)
    - parking to wait for  <0x000000072c928090> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at 
java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.3/LockSupport.java:234)
    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.3/AbstractQueuedSynchronizer.java:2123)
    at 
java.util.concurrent.ThreadPoolExecutor.awaitTermination(java.base@11.0.3/ThreadPoolExecutor.java:1454)
    at 
org.apache.ratis.util.ConcurrentUtils.{*}shutdownAndWait{*}(ConcurrentUtils.java:143)
    at 
org.apache.ratis.util.ConcurrentUtils.shutdownAndWait(ConcurrentUtils.java:135)
    at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$close$6(RaftServerProxy.java:432)
    at 
org.apache.ratis.server.impl.RaftServerProxy$$Lambda$1122/0x0000000800a7e440.run(Unknown
 Source)
    at 
org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$4(LifeCycle.java:299)
    at 
org.apache.ratis.util.LifeCycle$$Lambda$892/0x0000000800960c40.get(Unknown 
Source)
    at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:319)
    at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:297)
    at 
org.apache.ratis.server.impl.RaftServerProxy.close(RaftServerProxy.java:415)
    at 
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.stop(SCMRatisServerImpl.java:255)
    at 
org.apache.hadoop.hdds.scm.ha.SCMHAManagerImpl.stop(SCMHAManagerImpl.java:392)
    at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.stop(StorageContainerManager.java:1705)
    at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.lambda$start$0(StorageContainerManagerStarter.java:175)
    at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper$$Lambda$726/0x0000000800817840.run(Unknown
 Source)
    at 
java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515)
    at java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(java.base@11.0.3/Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to