[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589788#comment-14589788 ] Sunil G commented on YARN-3754: --- Hi [~bibinchundatt] If this issue is not reproducible as per latest trunk, cud u please mark this issue as closed. Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Sunil G Priority: Critical Attachments: NM.log Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574305#comment-14574305 ] Bibin A Chundatt commented on YARN-3754: [~rohithsharma] and [~sunilg] Have tried with build containing YARN-3585 and YARN-3641. org.iq80.leveldb.DBException: Closed. exception i am not able to reproduce . Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Sunil G Priority: Critical Attachments: NM.log Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572244#comment-14572244 ] Rohith commented on YARN-3754: -- bq. When NM is shutting down, ContainerLaunch is also interrupted. During this interrupted exception handling, NM tries to update container diagnostics. But from main thread statestore is down ,hence caused the DB Close exception. I think this issue caused since NM jvm did not exit on_time which allowed to process the statestore event. After YARN-3585 , I think this should be OK. [~bibinchundatt] Can you regression it pls Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Sunil G Priority: Critical Attachments: NM.log Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570790#comment-14570790 ] Sunil G commented on YARN-3754: --- I have got the logs from [~bibinchundatt] offline. {noformat} 2015-05-30 01:11:16,179 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_e313_1432908361253_4506_01_01 and exit code: 0 java.io.IOException: java.lang.InterruptedException ... ... 2015-05-30 01:11:16,179 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update diagnostics in state store for container_e313_1432908361253_4506_01_01 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostic {noformat} When NM is shutting down, ContainerLaunch is also interrupted. During this interrupted exception handling, NM tries to update container diagnostics. But from main thread statestore is down ,hence caused the DB Close exception. This scenario is handled in YARN-3641 already by [~djp] . [~bibinchundatt] could you please update this patch and check this and we can close this ticket as duplicate. Attaching NM logs too. Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Sunil G Priority: Critical Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569381#comment-14569381 ] Sunil G commented on YARN-3754: --- [~bibinchundatt] Could u also please attach NM logs here. Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Sunil G Priority: Critical Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched
[ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568508#comment-14568508 ] Sunil G commented on YARN-3754: --- I would like to work on this Jira. Please reassign otherwise. Thank you. Race condition when the NodeManager is shutting down and container is launched -- Key: YARN-3754 URL: https://issues.apache.org/jira/browse/YARN-3754 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Container is launched and returned to ContainerImpl NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. *Attaching the exception trace* {code} 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_02 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.iq80.leveldb.DBException: Closed at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) ... 15 more {code} we can add a check whether DB is closed while we move container from ACQUIRED state. As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)