[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-07 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607817#comment-16607817
 ] 

Eric Yang commented on YARN-8751:
-

[~ccondit-target] cherry-picked to branch-3.1.

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Critical
>  Labels: Docker
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8751.001.patch
>
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating script paths...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating local dirs...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Path 
> /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_05
>  has permission 774 but needs per
> mission 750.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null)
> 2018-08-31 

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-07 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607808#comment-16607808
 ] 

Hudson commented on YARN-8751:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14905 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14905/])
YARN-8751. Reduce conditions that mark node manager as unhealthy.
(eyang: rev 7d623343879ce9a8f8e64601024d018efc02794c)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java


> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Critical
>  Labels: Docker
> Fix For: 3.2.0
>
> Attachments: YARN-8751.001.patch
>
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating script paths...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating local dirs...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (Co

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-07 Thread Craig Condit (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607797#comment-16607797
 ] 

Craig Condit commented on YARN-8751:


[~eyang], [~shaneku...@gmail.com]: Do we want to commit this to branch-3.1 as 
well?

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Critical
>  Labels: Docker
> Fix For: 3.2.0
>
> Attachments: YARN-8751.001.patch
>
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating script paths...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating local dirs...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Path 
> /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_05
>  has permission 774 but needs per
> mission 750.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Wrote the

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-07 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607373#comment-16607373
 ] 

Eric Yang commented on YARN-8751:
-

+1 LGTM.

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Assignee: Craig Condit
>Priority: Critical
>  Labels: Docker
> Attachments: YARN-8751.001.patch
>
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating script paths...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating local dirs...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Path 
> /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_05
>  has permission 774 but needs per
> mission 750.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null)
> 2018-08-31 21:07:22,386 ERROR launcher.ContainerRelaunch 
> (ContainerRelaunch.java:cal

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-06 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606524#comment-16606524
 ] 

Hadoop QA commented on YARN-8751:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 27m 
15s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
 1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 32s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  0s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 18m 
54s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 98m 21s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 |
| JIRA Issue | YARN-8751 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12938719/YARN-8751.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux f1b1de2e20ea 4.4.0-133-generic #159-Ubuntu SMP Fri Aug 10 
07:31:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / eca1a4b |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21780/testReport/ |
| Max. process+thread count | 407 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21780/console |
| Powered by | Apache Yetus 0.

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-06 Thread Craig Condit (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606270#comment-16606270
 ] 

Craig Condit commented on YARN-8751:


[~shaneku...@gmail.com], looks like have consensus on the approach. I can take 
this one.

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Priority: Critical
>  Labels: Docker
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating script paths...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating local dirs...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Path 
> /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_05
>  has permission 774 but needs per
> mission 750.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Wrote the exit code 35 to (null)
> 2018-08-31 21:07:22,386 ERROR launcher.ContainerRelaunch 
> (ContainerRelaunch.ja

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-06 Thread Shane Kumpf (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606189#comment-16606189
 ] 

Shane Kumpf commented on YARN-8751:
---

Thanks for the feedback and suggestions everyone. I think the issue is most 
likely to happen under relaunch conditions with a poorly behaving container (as 
noted by [~eyang]). Relaunch (afaik) is only used by YARN Services today, so 
the impact may be isolated. Having said that, based on the conversation here, 
it does appear there are other non-fatal cases that could trigger these errors, 
so I'm +1 on the proposal from [~jlowe] affecting both launch and relaunch.

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Priority: Critical
>  Labels: Docker
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating script paths...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating local dirs...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Path 
> /grid/0/hadoop/yar

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-06 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606089#comment-16606089
 ] 

Eric Yang commented on YARN-8751:
-

+1 with [~jlowe]'s proposal that only INVALID_CONTAINER_EXEC_PERMISSIONS and 
INVALID_CONFIG_FILE throws ConfigurationException.  The other exit code are 
non-fatal and need to be kept at best effort of retries even when system is 
running with unfavorable conditions.

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Priority: Critical
>  Labels: Docker
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating script paths...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating local dirs...
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Path 
> /grid/0/hadoop/yarn/local/usercache/user/appcache/application_1535130383425_0085/container_e15_1535130383425_0085_01_05
>  has permission 774 but needs per
> mission 750.
> 2018-08-31 21:07:22,365 INFO  nodemanager.Contai

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-06 Thread Craig Condit (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606007#comment-16606007
 ] 

Craig Condit commented on YARN-8751:


Each of these error codes could have any number of root causes ranging from 
transient to task-specific, disk-specific, node-specific, or cluster-level. 
Trying to do root cause analysis of OS-level failures in code isn't really 
practical. No two environments are alike and it's going to be very difficult to 
set a policy which makes sense for all clusters. This is where things like 
admin-provided health check scripts come into play. These can check things like 
disks available, disks non-full, permissions (at top level dirs) set correctly, 
etc. That said, I think we should have defaults which cause the least amount of 
pain in the majority of cases. It seems to me that in most cases, it's far more 
likely to be a transient or per-disk issue causing these failures than a global 
misconfiguration, so not failing the NM makes sense.

As a way to address detection of the specific issue mentioned in this JIRA, 
top-level permissions on NM-controlled dirs could be validated on startup (if 
they aren't already) and cause a NM failure at that point (or at least consider 
the specific disk bad). This would cause fail-fast behavior for something that 
is clearly configured wrong globally. it would also make these issues occuring 
at a container level far more likely to be transient or task/app-specific.

 

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Priority: Critical
>  Labels: Docker
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerEx

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-06 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606003#comment-16606003
 ] 

Eric Yang commented on YARN-8751:
-

[~shaneku...@gmail.com] I believe the COULD_NOT_CREATE_WORK_DIRECTORIES exit 
code needs to happen on all disks before the option is exhausted.  Introduction 
of relaunch may single out a single working directory, and report a false 
positive response while the system may have option to fall back to create new 
working directory on other disks to move forward.  I am not sure if the test 
system has more than one local disks.  If it only had one disk, it may appear 
this single container crashes the node manager.  If relaunch doesn't retry 
other disks, then it is a bug to change container-executor logic to detect such 
case and create working directory on other disks.  This is similar to fault 
tolerance design in HDFS, relaunch is best effort to reuse the same working 
directory, but use other data directory, if the current one has turned bad.

Let's look at the problem from a different angles, the container is doing 
destructive operation to working directory and knock out all disks by abusing 
relaunch.  This looks more like a deliberate attempt to sabotage the system.  
In this case, it is really system administrator's responsibility to disallow 
such badly behaved user/image to grant them privileged container.  This is same 
as saying, don't hand them a chainsaw, if you know they are irresponsible 
individuals.  There is little that can be done to protect irresponsible 
individuals from themselves.  You can only protect them by not giving them too 
much power.  Disable write mount for privileged container is the wrong option 
because there are real program that can run multi-users container that depends 
on privileged container feature.  If the badly behaved program is a QA test, 
then we may need to hand wave that we hand you a chainsaw, read the 
instructions and be careful with it.

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Priority: Critical
>  Labels: Docker
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (Con

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-06 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605969#comment-16605969
 ] 

Eric Badger commented on YARN-8751:
---

I agree that we shouldn't kill the NM because of something like bad permissions 
that only affects a single job. If that is possible, then a user could pretty 
easily bring down the entire cluster, which is double plus ungood. However, it 
would also be nice to still be able to mark the node bad in cases where things 
are really wrong and will affect all jobs. Just thinking out loud here, but if 
all of the disks are 100% full, the NM is going to fail every container that 
runs on it. Yes, NM blacklisting will help, but that has to be re-learned for 
each application (afaik). It would be nice to detect if the error is actually 
fatal to all jobs or not. And I'm not sure that's an easy thing to do when it 
comes to creating directories. Maybe someone else has an idea? 

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Priority: Critical
>  Labels: Docker
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-06 Thread Craig Condit (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605895#comment-16605895
 ] 

Craig Condit commented on YARN-8751:


{quote}[~jlowe] : So my vote is keep INVALID_CONTAINER_EXEC_PERMISSIONS and 
INVALID_CONFIG_FILE fatal but the others should only fail the single container 
launch rather than the whole NM process.
{quote}
Agreed. The remainder of the exit codes could be caused by any number of 
things, such as disk failure, which you point out. Even if the problem were to 
be caused by something more systemic, NM blacklisting should kick in pretty 
quickly as tasks fail. +1 on making this non-fatal. Additionally, we may want 
to consider updating the diagnostic message returned in the following {{else}} 
clause to contain the exit code enum name as well as the number – this would 
seem to make diagnosing problems much easier for both users and administrators.

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Priority: Critical
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creating script paths...
> 2018-08

[jira] [Commented] (YARN-8751) Container-executor permission check errors cause the NM to be marked unhealthy

2018-09-06 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605803#comment-16605803
 ] 

Jason Lowe commented on YARN-8751:
--

A bad container executor or config file is pretty catastrophic since the NM 
can't control anything at that point, including the inability to even cleanup 
containers when it shuts down.  However the other errors are specific to 
setting up an individual container and should not bring down the NM.  If a disk 
goes bad and the container executor can't create one of the directories then 
this should not be a fatal error to the NM, just a fatal error to that 
container launch.  Otherwise a single disk failure can bring down the NM if the 
container executor discovers it before the NM disk checker does.

So my vote is keep INVALID_CONTAINER_EXEC_PERMISSIONS and INVALID_CONFIG_FILE 
fatal but the others should only fail the single container launch rather than 
the whole NM process.

> Container-executor permission check errors cause the NM to be marked unhealthy
> --
>
> Key: YARN-8751
> URL: https://issues.apache.org/jira/browse/YARN-8751
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Shane Kumpf
>Priority: Major
>
> {{ContainerLaunch}} (and {{ContainerRelaunch}}) contains logic to mark a 
> NodeManager as UNHEALTHY if a {{ConfigurationException}} is thrown by 
> {{ContainerLaunch#launchContainer}} (or relaunchContainer). The exception 
> occurs based on the exit code returned by container-executor, and 7 different 
> exit codes cause the NM to be marked UNHEALTHY.
> {code:java}
> if (exitCode ==
> ExitCode.INVALID_CONTAINER_EXEC_PERMISSIONS.getExitCode() ||
> exitCode ==
> ExitCode.INVALID_CONFIG_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_SCRIPT_COPY.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_CREDENTIALS_FILE.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_WORK_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_APP_LOG_DIRECTORIES.getExitCode() ||
> exitCode ==
> ExitCode.COULD_NOT_CREATE_TMP_DIRECTORIES.getExitCode()) {
>   throw new ConfigurationException(
>   "Linux Container Executor reached unrecoverable exception", e);{code}
> I can understand why these are treated as fatal with the existing process 
> container model. However, with privileged Docker containers this may be too 
> harsh, as Privileged Docker containers don't guarantee the user's identity 
> will be propagated into the container, so these mismatches can occur. Outside 
> of privileged containers, an application may inadvertently change the 
> permissions on one of these directories, triggering this condition.
> In our case, a container changed the "appcache//" 
> directory permissions to 774. Some time later, the process in the container 
> died and the Retry Policy kicked in to RELAUNCH the container. When the 
> RELAUNCH occurred, container-executor checked the permissions of the 
> "appcache//" directory (the existing workdir is retained 
> for RELAUNCH) and returned exit code 35. Exit code 35 is 
> COULD_NOT_CREATE_WORK_DIRECTORIES, which is a fatal error. This killed all 
> containers running on that node, when really only this container would have 
> been impacted.
> {code:java}
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception from container-launch.
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Container id: 
> container_e15_1535130383425_0085_01_05
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exit code: 35
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Exception message: Relaunch 
> container failed
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell error output: Could not 
> create container dirsCould not create local files and directories 5 6
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) -
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Shell output: main : command 
> provided 4
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : run as user is user
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - main : requested yarn user is yarn
> 2018-08-31 21:07:22,365 INFO  nodemanager.ContainerExecutor 
> (ContainerExecutor.java:logOutput(541)) - Creatin