[ https://issues.apache.org/jira/browse/YARN-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219536#comment-16219536 ]
Eric Badger commented on YARN-7395: ----------------------------------- Here's the relevant lines from the NM log {noformat} 2017-10-25 20:03:07,549 [Container Monitor] WARN monitor.ContainersMonitorImpl: Process tree for container: container_e126_1508911755032_0004_02_000001 has processes older than 1 iteration running over the configured limit. Limit=536870912, current usage = 585281536 2017-10-25 20:03:07,551 [Container Monitor] WARN monitor.ContainersMonitorImpl: Container [pid=29030,containerID=container_e126_1508911755032_0004_02_000001] is running beyond physical memory limits. Current usage: 558.2 MB of 512 MB physical memory used; 2.8 GB of 1.0 GB virtual memory used. Killing container. Dump of the process-tree for container_e126_1508911755032_0004_02_000001 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 29065 29030 29030 29030 (java) 6022 290 2962636800 142606 /bin/java -Djava.io.tmpdir=/tmp/yarn-local/usercache/ebadger/appcache/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -XX:ErrorFile=/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/hs_err_pid%p.log -XX:GCTimeLimit=50 -XX:ParallelGCThreads=4 -XX:NewRatio=8 -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/gc.log -Xmx1024m -XX:NewRatio=8 -Djava.net.preferIPv4Stack=true org.apache.hadoop.mapreduce.v2.app.MRAppMaster |- 29030 29014 29030 29030 (bash) 3 2 9474048 285 /bin/bash -c /bin/java -Djava.io.tmpdir=/tmp/yarn-local/usercache/ebadger/appcache/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog -XX:ErrorFile=/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/hs_err_pid%p.log -XX:GCTimeLimit=50 -XX:ParallelGCThreads=4 -XX:NewRatio=8 -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/gc.log -Xmx1024m -XX:NewRatio=8 -Djava.net.preferIPv4Stack=true org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/stdout 2>/tmp/yarn-logs/application_1508911755032_0004/container_e126_1508911755032_0004_02_000001/stderr 2017-10-25 20:03:07,551 [Container Monitor] INFO monitor.ContainersMonitorImpl: Removed ProcessTree with root 29030 2017-10-25 20:03:07,551 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e126_1508911755032_0004_02_000001 transitioned from RUNNING to KILLING 2017-10-25 20:03:07,552 [AsyncDispatcher event handler] INFO launcher.ContainerLaunch: Cleaning up container container_e126_1508911755032_0004_02_000001 2017-10-25 20:03:07,576 [AsyncDispatcher event handler] WARN nodemanager.LinuxContainerExecutor: Error in signalling container 29030 with SIGTERM; exit = 1 org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Signal container failed at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.signalContainer(DockerLinuxContainerRuntime.java:615) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:510) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:473) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:140) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:56) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) 2017-10-25 20:03:07,576 [AsyncDispatcher event handler] INFO nodemanager.ContainerExecutor: Using command stop 'container_e126_1508911755032_0004_02_000001' 2017-10-25 20:03:07,576 [AsyncDispatcher event handler] WARN launcher.ContainerLaunch: Exception when trying to cleanup container container_e126_1508911755032_0004_02_000001: java.io.IOException: Problem signalling container 29030 with SIGTERM; output: Using command stop 'container_e126_1508911755032_0004_02_000001' and exitCode: 1 at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:521) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java:473) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:140) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher.handle(ContainersLauncher.java:56) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Signal container failed at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.signalContainer(DockerLinuxContainerRuntime.java:615) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:510) ... 6 more {noformat} > NM fails to successfully kill tasks that run over their memory limit > -------------------------------------------------------------------- > > Key: YARN-7395 > URL: https://issues.apache.org/jira/browse/YARN-7395 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn > Reporter: Eric Badger > > The NM correctly notes that the container is over its configured limit, but > then fails to successfully kill the process. So the Docker container AM stays > around and the job keeps running -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org