[ https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13219498#comment-13219498 ]
Hudson commented on MAPREDUCE-3728: ----------------------------------- Integrated in Hadoop-Hdfs-trunk-Commit #1886 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1886/]) MAPREDUCE-3728. ShuffleHandler can't access results when configured in a secure mode (ahmed via tucu) (Revision 1295245) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1295245 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java > ShuffleHandler can't access results when configured in a secure mode > -------------------------------------------------------------------- > > Key: MAPREDUCE-3728 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2, nodemanager > Affects Versions: 0.23.0 > Reporter: Roman Shaposhnik > Assignee: Ahmed Radwan > Priority: Critical > Fix For: 0.23.3 > > Attachments: MAPREDUCE-3728.patch > > > While running the simplest of jobs (Pi) on MR2 in a fully secure > configuration I have noticed that the job was failing on the reduce side with > the following messages littering the nodemanager logs: > {noformat} > 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: > Shuffle error > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_000003_0/file.out.index > in any of the configured local directories > {noformat} > While digging further I found out that the permissions on the files/dirs were > prohibiting nodemanager (running under the user yarn) to access these files: > {noformat} > $ ls -l > /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_000001_0 > -rw-r----- 1 testuser testuser 28 Jan 20 15:41 file.out > -rw-r----- 1 testuser testuser 32 Jan 20 15:41 file.out.index > {noformat} > Digging even further revealed that the group-sticky bit that was faithfully > put on all the subdirectories between testuser and > application_1327102703969_0001 was gone from output and > attempt_1327102703969_0001_m_000001_0. > Looking into how these subdirectories are created > (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs()) > {noformat} > // $x/usercache/$user/appcache/$appId/filecache > Path appFileCacheDir = new Path(appBase, FILECACHE); > appsFileCacheDirs[i] = appFileCacheDir.toString(); > lfs.mkdir(appFileCacheDir, null, false); > // $x/usercache/$user/appcache/$appId/output > lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false); > {noformat} > Reveals that lfs.mkdir ends up manipulating permissions and thus clears > sticky bit from output and filecache. > At this point I'm at a loss about how this is supposed to work. My > understanding was > that the whole sequence of events here was predicated on a sticky bit set so > that daemons running under the user yarn (default group yarn) can have access > to the resulting files and subdirectories down at output and below. Please let > me know if I'm missing something or whether this is just a bug that needs to > be fixed. > On a related note, when the shuffle side of the Pi job failed the job itself > didn't. > It went into the endless loop and only exited when it exhausted all the local > storage > for the log files (at which point the nodemanager died and thus the job > ended). Perhaps > this is even more serious side effect of this issue that needs to be > investigated > separately. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira