[ https://issues.apache.org/jira/browse/YARN-9518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833208#comment-16833208 ]
Shurong Mai commented on YARN-9518: ----------------------------------- [~jhung], YARN-2194 looks the same problem as this issue, but it supplies another different solution. > can't use CGroups with YARN in centos7 > --------------------------------------- > > Key: YARN-9518 > URL: https://issues.apache.org/jira/browse/YARN-9518 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.2.0, 2.9.2, 2.8.5, 2.7.7, 3.1.2 > Reporter: Shurong Mai > Priority: Major > Labels: cgroup, patch > Attachments: YARN-9518.patch > > > The os version is centos7. > {code:java} > cat /etc/redhat-release > CentOS Linux release 7.3.1611 (Core) > {code} > When I had set configuration variables for cgroup with yarn, nodemanager > could be start without any matter. But when I ran a job, the job failed with > these exceptional nodemanager logs in the end. > In these logs, the important logs is " Can't open file /sys/fs/cgroup/cpu as > node manager - Is a directory " > After I analysed, I found the reason. In centos6, the cgroup "cpu" and > "cpuacct" subsystem are as follows: > {code:java} > /sys/fs/cgroup/cpu > /sys/fs/cgroup/cpuacct > {code} > But in centos7, as follows: > {code:java} > /sys/fs/cgroup/cpu -> cpu,cpuacct > /sys/fs/cgroup/cpuacct -> cpu,cpuacct > /sys/fs/cgroup/cpu,cpuacct{code} > "cpu" and "cpuacct" have merge as "cpu,cpuacct". "cpu" and "cpuacct" are > symbol links. > As I look at source code, nodemamager get the cgroup subsystem info by > reading /proc/mounts. So It get the cpu and cpuacct subsystem path are also > "/sys/fs/cgroup/cpu,cpuacct". > The resource description arguments of container-executor is such as follows: > {code:java} > cgroups=/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_000001/tasks > {code} > There is a comma in the cgroup path, but the comma is separator of multi > resource. Therefore, the cgroup path is truncated by container-executor as > "/sys/fs/cgroup/cpu" rather than correct cgroup path " > /sys/fs/cgroup/cpu,cpuacct/hadoop-yarn/container_1554210318404_0057_02_000001/tasks > " and report the error in the log " Can't open file /sys/fs/cgroup/cpu as > node manager - Is a directory " > Hence I modify the source code and submit a patch. The idea of patch is that > nodemanager get the cgroup cpu path as "/sys/fs/cgroup/cpu" rather than > "/sys/fs/cgroup/cpu,cpuacct". As a result, the resource description > arguments of container-executor is such as follows: > {code:java} > cgroups=/sys/fs/cgroup/cpu/hadoop-yarn/container_1554210318404_0057_02_000001/tasks > {code} > Note that there is no comma in the path, and is a valid path because > "/sys/fs/cgroup/cpu" is symbol link to "/sys/fs/cgroup/cpu,cpuacct". > After applied the patch, the problem is resolved and the job can run > successfully. > The patch is compatible with cgroup path of history os version such as > centos6, centos7 , and universally applicable to cgroup subsystem paths such > as cgroup network subsystem as follows: > {code:java} > /sys/fs/cgroup/net_cls -> net_cls,net_prio > /sys/fs/cgroup/net_prio -> net_cls,net_prio > /sys/fs/cgroup/net_cls,net_prio{code} > > > ################################################################################################################################## > {panel:title=exceptional nodemanager logs:} > 2019-04-19 20:17:20,095 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1554210318404_0042_01_000001 transitioned from LOCALIZED > to RUNNING > 2019-04-19 20:17:20,101 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exit code > from container container_1554210318404_0042_01_000001 is : 27 > 2019-04-19 20:17:20,103 WARN > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception > from container-launch with container ID: container_155421031840 > 4_0042_01_000001 and exit code: 27 > ExitCodeException exitCode=27: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:585) > at org.apache.hadoop.util.Shell.run(Shell.java:482) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2019-04-19 20:17:20,108 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from > container-launch. > 2019-04-19 20:17:20,108 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: > container_1554210318404_0042_01_000001 > 2019-04-19 20:17:20,108 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 27 > 2019-04-19 20:17:20,108 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: > ExitCodeException exitCode=27: > 2019-04-19 20:17:20,108 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at > org.apache.hadoop.util.Shell.runCommand(Shell.java:585) > 2019-04-19 20:17:20,108 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at > org.apache.hadoop.util.Shell.run(Shell.java:482) > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776) > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:299) > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at > java.util.concurrent.FutureTask.run(FutureTask.java:266) > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at > java.lang.Thread.run(Thread.java:745) > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell output: > main : command provided 1 > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : user is > test_hadoop > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: main : requested > yarn user is datadev > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Writing to > cgroup task files... > 2019-04-19 20:17:20,109 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Can't open file > /sys/fs/cgroup/cpu as node manager - Is a directory > 2019-04-19 20:17:20,131 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Container exited with a non-zero exit code 27 > 2019-04-19 20:17:20,133 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1554210318404_0042_01_000001 transitioned from RUNNING > to EXITED_WITH_FAILURE > 2019-04-19 20:17:20,133 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: > Cleaning up container container_1554210318404_0042_01_000001 > > {panel} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org