[ 
https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890435#comment-16890435
 ] 

Eric Yang commented on YARN-9647:
---------------------------------

[~ebadger], I think the approach taken is ok.  We want to filter out bad disk 
from allowed mount to guard against user defined mount point or system 
suggested mount point.  The difficult part is to identify if the mount path is 
user specified or system suggested.  In .cmd file, both user specified and 
system suggested paths are listed together.  There is no easy way to rotate to 
a different disk, unless node manager relaunch the container with another set 
of workdir paths.

[~magnum] , I think [~ebadger] is also right that this patch may have 
misleading error message when bad disk happens.  We can resolve this error by 
keeping track of the original container-executor.cfg, and normalized list.  
When two lists are not matching, container-executor can provide a different 
error message that container failed to launch due to unhealthy disk rather than 
continuing.

Would this work?

> Docker launch fails when local-dirs or log-dirs is unhealthy.
> -------------------------------------------------------------
>
>                 Key: YARN-9647
>                 URL: https://issues.apache.org/jira/browse/YARN-9647
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.1.2
>            Reporter: KWON BYUNGCHANG
>            Priority: Major
>         Attachments: YARN-9647.001.patch, YARN-9647.002.patch
>
>
> my /etc/hadoop/conf/container-executor.cfg
> {code}
> [docker]
>    docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
>    docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
> {code}
> if /data2 is unhealthy, docker launch fails  although container can use 
> /data1 as local-dir, log-dir 
> error message is below
> {code}
> [2019-06-25 14:55:26.168]Exception from container-launch. Container id: 
> container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: 
> Launch container failed Shell error output: Could not determine real path of 
> mount '/data2/hadoop/yarn/local' Could not determine real path of mount 
> '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk 
> Error constructing docker command, docker error code=16, error message='Mount 
> access error' Shell output: main : command provided 4 main : run as user is 
> magnum main : requested yarn user is magnum Creating script paths... Creating 
> local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit 
> code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code 
> 29. 
> {code}
> root cause is that normalize_mounts() in docker-util.c return -1  because it 
> cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is 
> disk fault  at this point)
> however disk of nm local dirs and nm log dirs can fail at any time.
> docker launch should succeed if there are available local dirs and log dirs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to