[ https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jian He updated YARN-864: ------------------------- Attachment: YARN-864.1.patch patch for NM clean up containers on SHUTDOWN and REBOOT event. > YARN NM leaking containers with CGroups > --------------------------------------- > > Key: YARN-864 > URL: https://issues.apache.org/jira/browse/YARN-864 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.0.5-alpha > Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and > YARN-600. > Reporter: Chris Riccomini > Attachments: rm-log, YARN-864.1.patch > > > Hey Guys, > I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm > seeing containers getting leaked by the NMs. I'm not quite sure what's going > on -- has anyone seen this before? I'm concerned that maybe it's a > mis-understanding on my part about how YARN's lifecycle works. > When I look in my AM logs for my app (not an MR app master), I see: > 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. > This means that container container_1371141151815_0008_03_000002 was killed > by YARN, either due to being released by the application master or being > 'lost' due to node failures etc. > 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container > container_1371141151815_0008_03_000002 was assigned task ID 0. Requesting a > new container for the task. > The AM has been running steadily the whole time. Here's what the NM logs say: > {noformat} > 05:34:59,783 WARN AsyncDispatcher:109 - Interrupted Exception while stopping > java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1143) > at java.lang.Thread.join(Thread.java:1196) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107) > at > org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) > at > org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) > at java.lang.Thread.run(Thread.java:619) > 05:35:00,314 WARN ContainersMonitorImpl:463 - > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl > is interrupted. Exiting. > 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup > at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598 > 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup > at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_000002 > 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. > java.io.IOException: java.lang.InterruptedException > at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) > at org.apache.hadoop.util.Shell.run(Shell.java:129) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. > java.io.IOException: java.lang.InterruptedException > at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) > at org.apache.hadoop.util.Shell.run(Shell.java:129) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > {noformat} > And, if I look on the machine that's running > container_1371141151815_0008_03_000002, I see: > {noformat} > $ ps -ef | grep container_1371141151815_0008_03_000002 > criccomi 5365 27915 38 Jun18 ? 21:35:05 > /export/apps/jdk/JDK-1_6_0_21/bin/java -cp > /path-to-yarn-data-dir/usercache/criccomi/appcache/application_1371141151815_0008/container_1371141151815_0008_03_000002/... > {noformat} > The same holds true for container_1371141151815_0006_01_001598. When I look > in the container logs, it's just happily running. No kill signal appears to > be sent, and no error appears. > Lastly, the RM logs show no major events around the time of the leak > (5:35am). I am able to reproduce this simply by waiting about 12 hours, or > so, and it seems to have started happening after I switched over to CGroups > and LCE, and turned on stateful RM (using file system). > Any ideas what's going on? > Thanks! > Chris -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira