[ https://issues.apache.org/jira/browse/YARN-8508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558847#comment-16558847 ]
genericqa commented on YARN-8508: --------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 45s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 48s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 0s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 76m 41s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8508 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12933248/YARN-8505.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 95544b2179d7 3.13.0-139-generic #188-Ubuntu SMP Tue Jan 9 14:43:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / be150a1 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21380/testReport/ | | Max. process+thread count | 336 (vs. ulimit of 10000) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21380/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > GPU does not get released even though the container is killed > -------------------------------------------------------------- > > Key: YARN-8508 > URL: https://issues.apache.org/jira/browse/YARN-8508 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Sumana Sathish > Assignee: Chandni Singh > Priority: Major > Attachments: YARN-8505.001.patch, YARN-8505.002.patch > > > GPU failed to release even though the container using it is being killed > {Code} > 2018-07-06 05:22:26,201 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_000001 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,250 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_000002 transitioned from RUNNING to > KILLING > 2018-07-06 05:22:26,251 INFO application.ApplicationImpl > (ApplicationImpl.java:handle(632)) - Application > application_1530854311763_0006 transitioned from RUNNING to > FINISHING_CONTAINERS_WAIT > 2018-07-06 05:22:26,251 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container > container_e20_1530854311763_0006_01_000002 > 2018-07-06 05:22:31,358 INFO launcher.ContainerLaunch > (ContainerLaunch.java:getContainerPid(1102)) - Could not get pid for > container_e20_1530854311763_0006_01_000002. Waited for 5000 ms. > 2018-07-06 05:22:31,358 WARN launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(784)) - Container clean up before pid > file created container_e20_1530854311763_0006_01_000002 > 2018-07-06 05:22:31,359 INFO launcher.ContainerLaunch > (ContainerLaunch.java:reapDockerContainerNoPid(940)) - Unable to obtain pid, > but docker container request detected. Attempting to reap container > container_e20_1530854311763_0006_01_000002 > 2018-07-06 05:22:31,494 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_000002/launch_container.sh > 2018-07-06 05:22:31,500 INFO nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:deleteAsUser(828)) - Deleting absolute path : > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1530854311763_0006/container_e20_1530854311763_0006_01_000002/container_tokens > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_000001 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,510 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_000002 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > 2018-07-06 05:22:31,512 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_000001 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:31,513 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0006_01_000002 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE > 2018-07-06 05:22:38,955 INFO container.ContainerImpl > (ContainerImpl.java:handle(2093)) - Container > container_e20_1530854311763_0007_01_000002 transitioned from NEW to SCHEDULED > {Code} > New container requesting for GPU fails to launch > {code} > 2018-07-06 05:22:39,048 ERROR nodemanager.LinuxContainerExecutor > (LinuxContainerExecutor.java:handleLaunchForLaunchType(550)) - > ResourceHandlerChain.preStart() failed! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_000002, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:509) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:494) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-07-06 05:22:39,049 WARN launcher.ContainerLaunch > (ContainerLaunch.java:call(331)) - Failed to launch container. > java.io.IOException: ResourceHandlerChain.preStart() failed! > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:551) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:494) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:103) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > Failed to find enough GPUs, > requestor=container_e20_1530854311763_0007_01_000002, #RequestedGPUs=2, > #availableGpus=1 > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.internalAssignGpus(GpuResourceAllocator.java:225) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceAllocator.assignGpus(GpuResourceAllocator.java:173) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.gpu.GpuResourceHandlerImpl.preStart(GpuResourceHandlerImpl.java:98) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.preStart(ResourceHandlerChain.java:75) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:509) > ... 8 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org