Andrew Rowson created SPARK-5655:
------------------------------------

             Summary: YARN Auxiliary Shuffle service can't access shuffle files 
on Hadoop cluster configured in secure mode
                 Key: SPARK-5655
                 URL: https://issues.apache.org/jira/browse/SPARK-5655
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 1.2.0
         Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2
            Reporter: Andrew Rowson


When running a Spark job on a YARN cluster which doesn't run containers under 
the same user as the nodemanager, and also when using the YARN auxiliary 
shuffle service, jobs fail with something similar to:

java.io.FileNotFoundException: 
/data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
 (Permission denied)

The root cause of this here: 
https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287

Spark will attempt to chmod 700 any application directories it creates during 
the job, which includes files created in the nodemanager's usercache directory. 
The owner of these files is the container UID, which on a secure cluster is the 
name of the user creating the job, and on an nonsecure cluster but with the 
yarn.nodemanager.container-executor.class configured is the value of 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.

The problem with this is that the auxiliary shuffle manager runs as part of the 
nodemanager, which is typically running as the user 'yarn'. This can't access 
these files that are only owner-readable.

YARN already attempts to secure files created under appcache but keep them 
readable by the nodemanager, by setting the group of the appcache directory to 
'yarn' and also setting the setgid flag. This means that files and directories 
created under this should also have the 'yarn' group. Normally this means that 
the nodemanager should also be able to read these files, but Spark setting 
chmod700 wipes this out.

I'm not sure what the right approach is here. Commenting out the chmod700 
functionality makes this work on YARN, and still makes the application files 
only readable by the owner and the group:

data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
 # ls -lah
total 206M
drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
-rw-r-----  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data

But this may not be the right approach on non-YARN. Perhaps an additional step 
to see if this chmod700 step is necessary (ie non-YARN) is required. Sadly, I 
don't have a non-YARN environment to test, otherwise I'd be able to suggest a 
patch.

I believe this is a related issue in the MapReduce framwork: 
https://issues.apache.org/jira/browse/MAPREDUCE-3728



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to