[jira] [Resolved] (YARN-10477) runc launch failure should not cause nodemanager to go unhealthy

Jim Brennan (Jira) Wed, 28 Oct 2020 13:32:19 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jim Brennan resolved YARN-10477.
--------------------------------
    Resolution: Invalid

Closing this as invalid.  The problem was only there in our internal version of 
container-executor.  I should have checked the code in trunk before filing.


> runc launch failure should not cause nodemanager to go unhealthy
> ----------------------------------------------------------------
>
>                 Key: YARN-10477
>                 URL: https://issues.apache.org/jira/browse/YARN-10477
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 3.3.1, 3.4.1
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>
> We have observed some failures when launching containers with runc.  We have 
> not yet identified the root cause of those failures, but a side-effect of 
> these failures was the Nodemanager marked itself unhealthy.  Since these are 
> rare failures that only affect a single launch, they should not cause the 
> Nodemanager to be marked unhealthy.
> Here is an example RM log:
> {noformat}
> resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event 
> dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with 
> details: Linux Container Executor reached unrecoverable exception
> {noformat}
> And here is an example of the NM log:
> {noformat}
> 2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO 
> runtime.RuncContainerRuntime: Launch container failed for 
> container_e25_1601602719874_10691_01_001723
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=24: OCI command has bad/missing local dire
> ctories
> {noformat}
> The problem is that the runc code in container-executor is re-using exit code 
> 24 (INVALID_CONFIG_FILE) which is intended for problems with the 
> container-executor.cfg file, and those failures are fatal for the NM.  We 
> should use a different exit code for these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-10477) runc launch failure should not cause nodemanager to go unhealthy

Reply via email to