[ https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Or updated SPARK-5655: ----------------------------- Affects Version/s: 1.3.0 > YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster > configured in secure mode > ----------------------------------------------------------------------------------------------------- > > Key: SPARK-5655 > URL: https://issues.apache.org/jira/browse/SPARK-5655 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.3.0, 1.2.1 > Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2 > Reporter: Andrew Rowson > Priority: Critical > Labels: hadoop > > When running a Spark job on a YARN cluster which doesn't run containers under > the same user as the nodemanager, and also when using the YARN auxiliary > shuffle service, jobs fail with something similar to: > {code:java} > java.io.FileNotFoundException: > /data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index > (Permission denied) > {code} > The root cause of this here: > https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287 > Spark will attempt to chmod 700 any application directories it creates during > the job, which includes files created in the nodemanager's usercache > directory. The owner of these files is the container UID, which on a secure > cluster is the name of the user creating the job, and on an nonsecure cluster > but with the yarn.nodemanager.container-executor.class configured is the > value of yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user. > The problem with this is that the auxiliary shuffle manager runs as part of > the nodemanager, which is typically running as the user 'yarn'. This can't > access these files that are only owner-readable. > YARN already attempts to secure files created under appcache but keep them > readable by the nodemanager, by setting the group of the appcache directory > to 'yarn' and also setting the setgid flag. This means that files and > directories created under this should also have the 'yarn' group. Normally > this means that the nodemanager should also be able to read these files, but > Spark setting chmod700 wipes this out. > I'm not sure what the right approach is here. Commenting out the chmod700 > functionality makes this work on YARN, and still makes the application files > only readable by the owner and the group: > {code} > /data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c > # ls -lah > total 206M > drwxr-s--- 2 nobody yarn 4.0K Feb 6 18:30 . > drwxr-s--- 12 nobody yarn 4.0K Feb 6 18:30 .. > -rw-r----- 1 nobody yarn 206M Feb 6 18:30 shuffle_0_0_0.data > {code} > But this may not be the right approach on non-YARN. Perhaps an additional > step to see if this chmod700 step is necessary (ie non-YARN) is required. > Sadly, I don't have a non-YARN environment to test, otherwise I'd be able to > suggest a patch. > I believe this is a related issue in the MapReduce framwork: > https://issues.apache.org/jira/browse/MAPREDUCE-3728 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org