[ 
https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572244#comment-14572244
 ] 

Rohith commented on YARN-3754:
------------------------------

bq. When NM is shutting down, ContainerLaunch is also interrupted. During this 
interrupted exception handling, NM tries to update container diagnostics. But 
from main thread statestore is down ,hence caused the DB Close exception.
I think this issue caused since NM jvm did not exit on_time which allowed to 
process the statestore event. After YARN-3585 , I think this should be OK.
[~bibinchundatt] Can you regression it pls

> Race condition when the NodeManager is shutting down and container is launched
> ------------------------------------------------------------------------------
>
>                 Key: YARN-3754
>                 URL: https://issues.apache.org/jira/browse/YARN-3754
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>         Environment: Suse 11 Sp3
>            Reporter: Bibin A Chundatt
>            Assignee: Sunil G
>            Priority: Critical
>         Attachments: NM.log
>
>
> Container is launched and returned to ContainerImpl
> NodeManager closed the DB connection which resulting in 
> {{org.iq80.leveldb.DBException: Closed}}. 
> *Attaching the exception trace*
> {code}
> 2015-05-30 02:11:49,122 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Unable to update state store diagnostics for 
> container_e310_1432817693365_3338_01_000002
> java.io.IOException: org.iq80.leveldb.DBException: Closed
>         at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.iq80.leveldb.DBException: Closed
>         at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
>         at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
>         ... 15 more
> {code}
> we can add a check whether DB is closed while we move container from ACQUIRED 
> state.
> As per the discussion in YARN-3585 have add the same



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to