[ 
https://issues.apache.org/jira/browse/YARN-9647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16890503#comment-16890503
 ] 

Jim Brennan commented on YARN-9647:
-----------------------------------

[~ebadger], [~eyang], [~magnum] I think I'm following the discussion and I 
agree with the problem analysis.
{quote}It's slightly more nuanced than this. If the lists don't match the 
container still could've failed because of an invalid mount. Basically if we 
get an invalid mount error then we need to figure out whether that invalid 
mount was in the original allowed-mounts lists in container-executor.cfg. If it 
was, then the error message should indicate a bad disk. Otherwise, the usual 
invalid mount error message should be fine.
{quote}
Do we need to maintain two lists? check_mount_permitted() is already returning 
-1 in the case where the normalize_mount fails for the mount_src before even 
checking if it is permitted. If the disk is bad, I think this is where it will 
fail. I don't think we'll get to the point of checking whether it is permitted? 
Maybe we just need to change this error message:
{noformat}
fprintf(ERRORFILE, "Invalid docker mount '%s', realpath=%s\n", values[i], 
mount_src);
{noformat}
to
{noformat}
fprintf(ERRORFILE, "Invalid source path '%s' for docker mount '%s', maybe bad 
disk?\n", mount_src, values[i]);
{noformat}
Even better, pull the normalizing of mount_src out of check_mount_permitted and 
do it separately.
{noformat}
  char *normalized_path = normalize_mount(mount_src, 0);
  if (normalized_path == NULL) {
      fprintf(ERRORFILE, "Invalid source path '%s' for docker mount '%s', maybe 
bad disk?\n", mount_src, values[i]);
      ret = INVALID_DOCKER_MOUNT;
      goto free_and_exit;
  }
  permitted_rw = check_mount_permitted((const char **) permitted_rw_mounts, 
normalized_path);
  permitted_ro = check_mount_permitted((const char **) permitted_ro_mounts, 
normalized_path);

{noformat}
For paths coming from NM (local dirs / log dirs) it should have already checked 
to ensure bad ones aren't in the list.

> Docker launch fails when local-dirs or log-dirs is unhealthy.
> -------------------------------------------------------------
>
>                 Key: YARN-9647
>                 URL: https://issues.apache.org/jira/browse/YARN-9647
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.1.2
>            Reporter: KWON BYUNGCHANG
>            Priority: Major
>         Attachments: YARN-9647.001.patch, YARN-9647.002.patch
>
>
> my /etc/hadoop/conf/container-executor.cfg
> {code}
> [docker]
>    docker.allowed.ro-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
>    docker.allowed.rw-mounts=/data1/hadoop/yarn/local,/data2/hadoop/yarn/local
> {code}
> if /data2 is unhealthy, docker launch fails  although container can use 
> /data1 as local-dir, log-dir 
> error message is below
> {code}
> [2019-06-25 14:55:26.168]Exception from container-launch. Container id: 
> container_e50_1561100493387_5185_01_000597 Exit code: 29 Exception message: 
> Launch container failed Shell error output: Could not determine real path of 
> mount '/data2/hadoop/yarn/local' Could not determine real path of mount 
> '/data2/hadoop/yarn/local' Unable to find permitted docker mounts on disk 
> Error constructing docker command, docker error code=16, error message='Mount 
> access error' Shell output: main : command provided 4 main : run as user is 
> magnum main : requested yarn user is magnum Creating script paths... Creating 
> local dirs... [2019-06-25 14:55:26.189]Container exited with a non-zero exit 
> code 29. [2019-06-25 14:55:26.192]Container exited with a non-zero exit code 
> 29. 
> {code}
> root cause is that normalize_mounts() in docker-util.c return -1  because it 
> cannot resolve real path of /data2/hadoop/yarn/local.(note that /data2 is 
> disk fault  at this point)
> however disk of nm local dirs and nm log dirs can fail at any time.
> docker launch should succeed if there are available local dirs and log dirs.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to