[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607817#comment-16607817 ] Eric Yang commented on YARN-8751: - [~ccondit-target] cherry-picked to branch-3.1. > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Assignee: Craig Condit >Priority: Critical > Labels: Docker > Fix For: 3.2.0, 3.1.2 > > Attachments: YARN-8751.001.patch > > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating script paths... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating local dirs... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Path > /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_05 > has permission 774 but needs per > mission 750. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null) > 2018-08-31
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607808#comment-16607808 ] Hudson commented on YARN-8751: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14905 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14905/]) YARN-8751. Reduce conditions that mark node manager as unhealthy. (eyang: rev 7d623343879ce9a8f8e64601024d018efc02794c) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Assignee: Craig Condit >Priority: Critical > Labels: Docker > Fix For: 3.2.0 > > Attachments: YARN-8751.001.patch > > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating script paths... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating local dirs... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (Co
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607797#comment-16607797 ] Craig Condit commented on YARN-8751: [~eyang], [~shaneku...@gmail.com]: Do we want to commit this to branch-3.1 as well? > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Assignee: Craig Condit >Priority: Critical > Labels: Docker > Fix For: 3.2.0 > > Attachments: YARN-8751.001.patch > > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating script paths... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating local dirs... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Path > /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_05 > has permission 774 but needs per > mission 750. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Wrote the
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607373#comment-16607373 ] Eric Yang commented on YARN-8751: - +1 LGTM. > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Assignee: Craig Condit >Priority: Critical > Labels: Docker > Attachments: YARN-8751.001.patch > > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating script paths... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating local dirs... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Path > /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_05 > has permission 774 but needs per > mission 750. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null) > 2018-08-31 21:07:22,386 ERROR launcher.ContainerRelaunch > (ContainerRelaunch.java:cal
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606524#comment-16606524 ] Hadoop QA commented on YARN-8751: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 27m 15s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 32s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 0s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 54s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 98m 21s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-8751 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12938719/YARN-8751.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux f1b1de2e20ea 4.4.0-133-generic #159-Ubuntu SMP Fri Aug 10 07:31:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / eca1a4b | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21780/testReport/ | | Max. process+thread count | 407 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21780/console | | Powered by | Apache Yetus 0.
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606270#comment-16606270 ] Craig Condit commented on YARN-8751: [~shaneku...@gmail.com], looks like have consensus on the approach. I can take this one. > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Priority: Critical > Labels: Docker > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating script paths... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating local dirs... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Path > /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_05 > has permission 774 but needs per > mission 750. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null) > 2018-08-31 21:07:22,386 ERROR launcher.ContainerRelaunch > (ContainerRelaunch.ja
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606189#comment-16606189 ] Shane Kumpf commented on YARN-8751: --- Thanks for the feedback and suggestions everyone. I think the issue is most likely to happen under relaunch conditions with a poorly behaving container (as noted by [~eyang]). Relaunch (afaik) is only used by YARN Services today, so the impact may be isolated. Having said that, based on the conversation here, it does appear there are other non-fatal cases that could trigger these errors, so I'm +1 on the proposal from [~jlowe] affecting both launch and relaunch. > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Priority: Critical > Labels: Docker > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating script paths... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating local dirs... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Path > /grid/0/hadoop/yar
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606089#comment-16606089 ] Eric Yang commented on YARN-8751: - +1 with [~jlowe]'s proposal that only INVALID_CONTAINER_EXEC_PERMISSIONS and INVALID_CONFIG_FILE throws ConfigurationException. The other exit code are non-fatal and need to be kept at best effort of retries even when system is running with unfavorable conditions. > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Priority: Critical > Labels: Docker > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating script paths... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating local dirs... > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Path > /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_05 > has permission 774 but needs per > mission 750. > 2018-08-31 21:07:22,365 INFO nodemanager.Contai
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606007#comment-16606007 ] Craig Condit commented on YARN-8751: Each of these error codes could have any number of root causes ranging from transient to task-specific, disk-specific, node-specific, or cluster-level. Trying to do root cause analysis of OS-level failures in code isn't really practical. No two environments are alike and it's going to be very difficult to set a policy which makes sense for all clusters. This is where things like admin-provided health check scripts come into play. These can check things like disks available, disks non-full, permissions (at top level dirs) set correctly, etc. That said, I think we should have defaults which cause the least amount of pain in the majority of cases. It seems to me that in most cases, it's far more likely to be a transient or per-disk issue causing these failures than a global misconfiguration, so not failing the NM makes sense. As a way to address detection of the specific issue mentioned in this JIRA, top-level permissions on NM-controlled dirs could be validated on startup (if they aren't already) and cause a NM failure at that point (or at least consider the specific disk bad). This would cause fail-fast behavior for something that is clearly configured wrong globally. it would also make these issues occuring at a container level far more likely to be transient or task/app-specific. > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Priority: Critical > Labels: Docker > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerEx
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606003#comment-16606003 ] Eric Yang commented on YARN-8751: - [~shaneku...@gmail.com] I believe the COULD_NOT_CREATE_WORK_DIRECTORIES exit code needs to happen on all disks before the option is exhausted. Introduction of relaunch may single out a single working directory, and report a false positive response while the system may have option to fall back to create new working directory on other disks to move forward. I am not sure if the test system has more than one local disks. If it only had one disk, it may appear this single container crashes the node manager. If relaunch doesn't retry other disks, then it is a bug to change container-executor logic to detect such case and create working directory on other disks. This is similar to fault tolerance design in HDFS, relaunch is best effort to reuse the same working directory, but use other data directory, if the current one has turned bad. Let's look at the problem from a different angles, the container is doing destructive operation to working directory and knock out all disks by abusing relaunch. This looks more like a deliberate attempt to sabotage the system. In this case, it is really system administrator's responsibility to disallow such badly behaved user/image to grant them privileged container. This is same as saying, don't hand them a chainsaw, if you know they are irresponsible individuals. There is little that can be done to protect irresponsible individuals from themselves. You can only protect them by not giving them too much power. Disable write mount for privileged container is the wrong option because there are real program that can run multi-users container that depends on privileged container feature. If the badly behaved program is a QA test, then we may need to hand wave that we hand you a chainsaw, read the instructions and be careful with it. > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Priority: Critical > Labels: Docker > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (Con
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605969#comment-16605969 ] Eric Badger commented on YARN-8751: --- I agree that we shouldn't kill the NM because of something like bad permissions that only affects a single job. If that is possible, then a user could pretty easily bring down the entire cluster, which is double plus ungood. However, it would also be nice to still be able to mark the node bad in cases where things are really wrong and will affect all jobs. Just thinking out loud here, but if all of the disks are 100% full, the NM is going to fail every container that runs on it. Yes, NM blacklisting will help, but that has to be re-learned for each application (afaik). It would be nice to detect if the error is actually fatal to all jobs or not. And I'm not sure that's an easy thing to do when it comes to creating directories. Maybe someone else has an idea? > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Priority: Critical > Labels: Docker > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605895#comment-16605895 ] Craig Condit commented on YARN-8751: {quote}[~jlowe] : So my vote is keep INVALID_CONTAINER_EXEC_PERMISSIONS and INVALID_CONFIG_FILE fatal but the others should only fail the single container launch rather than the whole NM process. {quote} Agreed. The remainder of the exit codes could be caused by any number of things, such as disk failure, which you point out. Even if the problem were to be caused by something more systemic, NM blacklisting should kick in pretty quickly as tasks fail. +1 on making this non-fatal. Additionally, we may want to consider updating the diagnostic message returned in the following {{else}} clause to contain the exit code enum name as well as the number – this would seem to make diagnosing problems much easier for both users and administrators. > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Priority: Critical > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creating script paths... > 2018-08
[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy
[ https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605803#comment-16605803 ] Jason Lowe commented on YARN-8751: -- A bad container executor or config file is pretty catastrophic since the NM can't control anything at that point, including the inability to even cleanup containers when it shuts down. However the other errors are specific to setting up an individual container and should not bring down the NM. If a disk goes bad and the container executor can't create one of the directories then this should not be a fatal error to the NM, just a fatal error to that container launch. Otherwise a single disk failure can bring down the NM if the container executor discovers it before the NM disk checker does. So my vote is keep INVALID_CONTAINER_EXEC_PERMISSIONS and INVALID_CONFIG_FILE fatal but the others should only fail the single container launch rather than the whole NM process. > Container-executor permission check errors cause the NM to be marked unhealthy > -- > > Key: YARN-8751 > URL: https://issues.apache.org/jira/browse/YARN-8751 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Shane Kumpf >Priority: Major > > {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a > NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by > {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception > occurs based on the exit code returned by container-executor, and 7 different > exit codes cause the NM to be marked UNHEALTHY. > {code:java} > if (exitCode == > ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() || > exitCode == > ExitCode.INVALID_CONFIG_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() || > exitCode == > ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) { > throw new ConfigurationException( > "Linux Container Executor reached unrecoverable exception", e);{code} > I can understand why these are treated as fatal with the existing process > container model. However, with privileged Docker containers this may be too > harsh, as Privileged Docker containers don't guarantee the user's identity > will be propagated into the container, so these mismatches can occur. Outside > of privileged containers, an application may inadvertently change the > permissions on one of these directories, triggering this condition. > In our case, a container changed the "appcache//" > directory permissions to 774. Some time later, the process in the container > died and the Retry Policy kicked in to RELAUNCH the container. When the > RELAUNCH occurred, container-executor checked the permissions of the > "appcache//" directory (the existing workdir is retained > for RELAUNCH) and returned exit code 35. Exit code 35 is > COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all > containers running on that node, when really only this container would have > been impacted. > {code:java} > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception from container-launch. > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Container id: > container_e15_1535130383425_0085_01_05 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exit code: 35 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch > container failed > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not > create container dirsCould not create local files and directories 5 6 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Shell output: main : command > provided 4 > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : run as user is user > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn > 2018-08-31 21:07:22,365 INFO nodemanager.ContainerExecutor > (ContainerExecutor.java:logOutput(541)) - Creatin