Syed Shameerur Rahman created YARN-11826:
--------------------------------------------
Summary: NodeManger Process Stuck In LevelDB Close Operation While
Shutingdown
Key: YARN-11826
URL: https://issues.apache.org/jira/browse/YARN-11826
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 3.3.6
Reporter: Syed Shameerur Rahman
*Hadoop Version 3.3.6*
During NodeManager shutdown operation, it was noted that the shutdown operation
was stuck. On taking multiple thread dumps, it was noticed that the NodeManager
process was stuck during LevelDb close operation.
{code:java}
"Thread-566" #808 prio=5 os_prio=0 cpu=20.25ms elapsed=907.93s
tid=0x0000aaab1fb3da60 nid=0x104f runnable [0x0000ffff3dce6000]
java.lang.Thread.State: RUNNABLE
at org.fusesource.leveldbjni.internal.NativeDB$DBJNI.delete(Native Method)
at org.fusesource.leveldbjni.internal.NativeDB.delete(NativeDB.java:175)
at org.fusesource.leveldbjni.internal.JniDB.close(JniDB.java:55)
at
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.closeStorage(NMLeveldbStateStoreService.java:201)
at
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceStop(NMStateStoreService.java:378)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
- locked <0x0000000085b13450> (a java.lang.Object)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.stopRecoveryStore(NodeManager.java:329)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:530)
at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220)
- locked <0x00000000857da310> (a java.lang.Object)
at
org.apache.hadoop.yarn.server.nodemanager.NodeManager$1.run(NodeManager.java:543)
{code}
On further analysis - it was noted that the leveldb close will wait for any
pending compaction before it can close - but it was seen that there was pending
compaction
{code:java}
java.lang.Thread.State: RUNNABLE
at org.fusesource.leveldbjni.internal.NativeDB$DBJNI.CompactRange(Native
Method)
at
org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:423)
at
org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:418)
at
org.fusesource.leveldbjni.internal.NativeDB.compactRange(NativeDB.java:404)
at org.fusesource.leveldbjni.internal.JniDB.compactRange(JniDB.java:211)
at
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService$CompactionTimerTask.run(NMLeveldbStateStoreService.java:1736)
at java.util.TimerThread.mainLoop([email protected]/Timer.java:566)
at java.util.TimerThread.run([email protected]/Timer.java:516) {code}
# I checked the instance and it has enough disk space and other process are
able to write to the disk.
Is this somekind of issue with the level db ? Should NodeManager do timed
waiting for levelDb close instead of waiting infinitely ?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]