[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824644#comment-16824644 ]
Eric Yang edited comment on YARN-9486 at 4/23/19 11:19 PM: ----------------------------------------------------------- [~Jim_Brennan] I added a couple debug statement: {code:java} +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerCleanup.java @@ -96,9 +96,10 @@ public void run() { } // launch flag will be set to true if process already launched boolean alreadyLaunched = !launch.markLaunched(); + LOG.info("alreadyLaunched: "+alreadyLaunched+" isLaunchCompleted: "+launch.isLaunchCompleted()); if (!alreadyLaunched) { + LOG.info("!alreadyLaunched: "+!alreadyLaunched); LOG.info("Container " + containerIdStr + " not launched." + " No cleanup needed to be done"); return; {code} Output of the logs for node manager looks like this: {code:java} 2019-04-23 22:34:08,919 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch: Failed to relaunch container. java.io.IOException: Could not find nmPrivate/application_1556058714621_0001/container_1556058714621_0001_01_000002//container_1556058714621_0001_01_000002.pid in any of the directories at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getPathToRead(LocalDirsHandlerService.java:597) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForRead(LocalDirsHandlerService.java:612) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.getPidFilePath(ContainerRelaunch.java:200) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:90) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-04-23 22:34:08,922 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1556058714621_0001_01_000002 transitioned from RELAUNCHING to EXITED_WITH_FAILURE 2019-04-23 22:34:08,925 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: Cleaning up container container_1556058714621_0001_01_000002 2019-04-23 22:34:08,926 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: alreadyLaunched: false isLaunchCompleted: true 2019-04-23 22:34:08,926 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: !alreadyLaunched: true 2019-04-23 22:34:08,926 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: Container container_1556058714621_0001_01_000002 not launched. No cleanup needed to be done 2019-04-23 22:34:08,963 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /tmp/hadoop-yarn/nm-local-dir/usercache/hbase/appcache/application_1556058714621_0001/container_1556058714621_0001_01_000002 2019-04-23 22:34:08,963 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Privileged Execution Command Array: [/usr/local/hadoop-3.3.0-SNAPSHOT/bin/container-executor, hbase, hbase, 3, /tmp/hadoop-yarn/nm-local-dir/usercache/hbase/appcache/application_1556058714621_0001/container_1556058714621_0001_01_000002] 2019-04-23 22:34:08,963 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1556058714621_0001 CONTAINERID=container_1556058714621_0001_01_000002 2019-04-23 22:34:08,967 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1556058714621_0001_01_000002 transitioned from EXITED_WITH_FAILURE to DONE {code} If it is set to relaunch, the markedLaunched will return true because ContainerRelaunch is reset to not launched yet. This atomic boolean compare false to false, will returned true. Double logic negating for true is still true. This causes the failure to clean up the previous instance of the container. I think the added logic is necessary to ensure relaunch will proceed with clean up Docker container instance logic by checking if container had been completed. Do you agree with this analysis? was (Author: eyang): [~Jim_Brennan] I added a couple debug statement: {code:java} +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerCleanup.java @@ -96,9 +96,10 @@ public void run() { } // launch flag will be set to true if process already launched boolean alreadyLaunched = !launch.markLaunched(); + LOG.info("alreadyLaunched: "+alreadyLaunched+" isLaunchCompleted: "+launch.isLaunchCompleted()); if (!alreadyLaunched) { + LOG.info("!alreadyLaunched: "+!alreadyLaunched); LOG.info("Container " + containerIdStr + " not launched." + " No cleanup needed to be done"); return; {code} Output of the logs for node manager looks like this: {code:java} 2019-04-23 22:34:08,919 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch: Failed to relaunch container. java.io.IOException: Could not find nmPrivate/application_1556058714621_0001/container_1556058714621_0001_01_000002//container_1556058714621_0001_01_000002.pid in any of the directories at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getPathToRead(LocalDirsHandlerService.java:597) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForRead(LocalDirsHandlerService.java:612) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.getPidFilePath(ContainerRelaunch.java:200) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:90) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-04-23 22:34:08,922 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1556058714621_0001_01_000002 transitioned from RELAUNCHING to EXITED_WITH_FAILURE 2019-04-23 22:34:08,925 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: Cleaning up container container_1556058714621_0001_01_000002 2019-04-23 22:34:08,926 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: alreadyLaunched: false isLaunchCompleted: true 2019-04-23 22:34:08,926 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: !alreadyLaunched: true 2019-04-23 22:34:08,926 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: Container container_1556058714621_0001_01_000002 not launched. No cleanup needed to be done 2019-04-23 22:34:08,963 INFO org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Deleting absolute path : /tmp/hadoop-yarn/nm-local-dir/usercache/hbase/appcache/application_1556058714621_0001/container_1556058714621_0001_01_000002 2019-04-23 22:34:08,963 DEBUG org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Privileged Execution Command Array: [/usr/local/hadoop-3.3.0-SNAPSHOT/bin/container-executor, hbase, hbase, 3, /tmp/hadoop-yarn/nm-local-dir/usercache/hbase/appcache/application_1556058714621_0001/container_1556058714621_0001_01_000002] 2019-04-23 22:34:08,963 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1556058714621_0001 CONTAINERID=container_1556058714621_0001_01_000002 2019-04-23 22:34:08,967 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1556058714621_0001_01_000002 transitioned from EXITED_WITH_FAILURE to DONE {code} If it is set to relaunch, the markedLaunched will return false because it was previously marked by prepareForLaunch and launched. This atomic boolean compare false to true, will returned false. Double logic negating for false is still false. This causes the failure to clean up the previous instance of the container. I think the added logic is necessary to ensure relaunch will proceed with clean up Docker container instance logic by checking if container had been completed. Do you agree with this analysis? > Docker container exited with failure does not get clean up correctly > -------------------------------------------------------------------- > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task > Affects Versions: 3.2.0 > Reporter: Eric Yang > Assignee: Eric Yang > Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_000007//container_1555111445937_0008_01_000007.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_000007 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_000007 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_000007 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_000007 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_000007 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_000007 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_000007 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_000007 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_000007 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_000007 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_000007, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: <memory:1024, vCores:1>, Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_000007] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_000007 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org