[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607373#comment-16607373 ]
Eric Yang commented on YARN-8751: --------------------------------- +1 LGTM. > Container-executor permission check errors cause the NM to be marked unhealthy > ------------------------------------------------------------------------------ > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Shane Kumpf > Assignee: Craig Condit > Priority: Critical > Labels: Docker > Attachments: YARN-8751.001.patch > > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache/<appid>/<containerid>" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache/<appid>/<containerid>" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_000005 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating script paths... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating local dirs... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Path > /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_000005 > has permission 774 but needs per > mission 750. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null) > 2018-08-31 21:07:22,386 ERROR launcher.ContainerRelaunch > (ContainerRelaunch.java:call(129)) - Failed to launch container due to > configuration error. > org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container > Executor reached unrecoverable exception > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:633) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: > Relaunch container failed > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:987) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562) > ... 8 more > {code} > The root of the issue could be considered the fact that we can't guarantee > which user is running in the container, and should eliminate writable mounts > in this scenario. However, marking the NM unhealthy in all these cases does > seem overkill. > Opening this to discuss how we want to address this issue. [~jlowe] > [~ebadger] [~Jim_Brennan] [~eyang] [~billie.rinaldi] [~ccondit-target] let me > know your thoughts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org