[ https://issues.apache.org/jira/browse/YARN-8648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16599213#comment-16599213 ]
Jim Brennan commented on YARN-8648: ----------------------------------- [~jlowe] thanks for the review! {quote}Why was the postComplete call moved in reapContainer to before the container is removed via docker? Shouldn't docker first remove its cgroups for the container before we remove ours? {quote} I was trying to preserve the order of operations. Normally postComplete is called immediately after the launchContainer() returns, and then reapContainer() is called as part of cleanupContainer() processing. So the resource handlers usually get a chance to clean up cgroups before we cleanup the container. If we do the docker cleanup first, it will delete the cgroups before the resource handler postComplete processing - it doesn't know which ones are handled by resource handlers, so it just deletes them all. Since they both are really just deleting the cgroups, I don't think the order matters that much, so I will move it back if you think that is better. {quote}Is there a reason to separate removing docker cgroups from removing the docker container? This seems like a natural extension to cleaning up after a container run by docker, and that's already covered by the reap command. The patch would remain a docker-only change but without needing to modify the container-executor interface. {quote} It is currently being done as part of the reap processing, but as a separate privileged operation. We definitely could just add this processing to the remove-docker-container processing in container-executor, but it would require adding the yarn-hierarchy as an additional argument for the DockerRmCommand. This would also require changing the DockerContainerDeletionTask() to store the yarn-hierarchy String along with the ContainerId. Despite the additional container-executor interface, I think the current approach is less code/simpler, but I'm definitely willing to rework it if you think it is a better solution. {quote}Nit: PROC_MOUNT_PATH should be a macro (i.e.: #define) or lower-cased. Similar for CGROUP_MOUNT. {quote} I will fix these. {quote}The snprintf result should be checked for truncation in addition to output errors (i.e.: result >= PATH_MAX means it was truncated) otherwise we formulate an incomplete path targeted for deletion if that somehow occurs. Alternatively the code could use make_string or asprintf to allocate an appropriately sized buffer for each entry rather than trying to reuse a manually sized buffer. {quote} I will fix this. I forgot about make_string(). {quote}Is there any point in logging to the error file that a path we want to delete has already been deleted? This seems like it will just be noise, especially if systemd or something else is periodically cleaning some of these empty cgroups. {quote} I'll remove it - was nice while debugging, but not needed. {quote}Related to the previous comment, the rmdir result should be checked for ENOENT and treat that as success. {quote} I explicitly check that the directory exists before calling rmdir, so I'm not sure this is necessary, but I can add it anyway. {quote}Nit: I think lineptr should be freed in the cleanup label in case someone later adds a fatal error that jumps to cleanup. {quote} Will do. Thanks again for the review! > Container cgroups are leaked when using docker > ---------------------------------------------- > > Key: YARN-8648 > URL: https://issues.apache.org/jira/browse/YARN-8648 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Jim Brennan > Assignee: Jim Brennan > Priority: Major > Labels: Docker > Attachments: YARN-8648.001.patch > > > When you run with docker and enable cgroups for cpu, docker creates cgroups > for all resources on the system, not just for cpu. For instance, if the > {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, > the nodemanager will create a cgroup for each container under > {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path > via the {{--cgroup-parent}} command line argument. Docker then creates a > cgroup for the docker container under that, for instance: > {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. > When the container exits, docker cleans up the {{docker_container_id}} > cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is > good under {{/sys/fs/cgroup/hadoop-yarn}}. > The problem is that docker also creates that same hierarchy under every > resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these > are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, > perf_event, and systemd. So for instance, docker creates > {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but > it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up > the {{container_id}} cgroups for these other resources. On one of our busy > clusters, we found > 100,000 of these leaked cgroups. > I found this in our 2.8-based version of hadoop, but I have been able to > repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org