[ https://issues.apache.org/jira/browse/YARN-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer updated YARN-8382: ----------------------------------- Fix Version/s: 3.0.4 > cgroup file leak in NM > ---------------------- > > Key: YARN-8382 > URL: https://issues.apache.org/jira/browse/YARN-8382 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Environment: we write an container with a shutdownHook which has a > piece of code like "while(true) sleep(100)" . when > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms <* > *yarn.nodemanager.sleep-delay-before-sigkill.ms , cgourp file leak happens; > when* *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms >* > ** *yarn.nodemanager.sleep-delay-before-sigkill.ms, cgroup file is deleted > successfully*** > Reporter: Hu Ziqian > Assignee: Hu Ziqian > Priority: Major > Fix For: 3.2.0, 3.1.1, 3.0.4 > > Attachments: YARN-8382-branch-2.8.3.001.patch, > YARN-8382-branch-2.8.3.002.patch, YARN-8382.001.patch, YARN-8382.002.patch > > > As Jiandan said in YARN-6562, NM may delete Cgroup container file timeout > with logs like below: > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to > delete for 1000ms > > we found one situation is that when we set > *yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms*, the > cgroup file leak happens *.* > > One container process tree looks like follow graph: > bash(16097)───java(16099)─┬─\{java}(16100) > ├─\{java}(16101) > {{ ├─\{java}(16102)}} > > {{when NM kills a container, NM sends kill -15 -pid to kill container process > group. Bash process will exit when it received sigterm, but java process may > do some job (shutdownHook etc.), and doesn't exit unit receive sigkill. And > when bash process exits, CgroupsLCEResourcesHandler begin to try to delete > cgroup files. So when > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* > arrived, the java processes may still running and cgourp/tasks still not > empty and cause a cgroup file leak.}} > > {{we add a condition that > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must > bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this > problem.}} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org