[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-07-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13411480#comment-13411480
 ] 

Hudson commented on MAPREDUCE-3728:
---

Integrated in Hadoop-Hdfs-0.23-Build #310 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/310/])
svn merge -c 1295245 FIXES: MAPREDUCE-3728. ShuffleHandler can't access 
results when configured in a secure mode (ahmed via tucu) (Revision 1359724)

 Result = UNSTABLE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1359724
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java


 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.3, 2.0.0-alpha

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-03-01 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220041#comment-13220041
 ] 

Hudson commented on MAPREDUCE-3728:
---

Integrated in Hadoop-Mapreduce-trunk #1006 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1006/])
MAPREDUCE-3728. ShuffleHandler can't access results when configured in a 
secure mode (ahmed via tucu) (Revision 1295245)

 Result = SUCCESS
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1295245
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java


 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.3

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-02-29 Thread Ahmed Radwan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219461#comment-13219461
 ] 

Ahmed Radwan commented on MAPREDUCE-3728:
-

As I mentioned earlier, it seems that the problem is mainly on how the output 
directory is created in ContainerLocalizer.java. The setgid is lost in the 
process. I have even tried to set the permissions explicitly and setting setgid 
but also this doesn't seem to solve the issue. 

So I think the current patch, where the output directory creation is postponed 
is appropriate as it is correctly created afterwards by FileSystem.create(). We 
may also need to investigate more why mkdir in AbstractFileSystem in this case 
is causing the loss of setgid, and see if this a bug or it is an intentional 
behavior.

 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.1

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-02-29 Thread Alejandro Abdelnur (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219474#comment-13219474
 ] 

Alejandro Abdelnur commented on MAPREDUCE-3728:
---

+1. I cannot think of any side effect by this change. Please open a JIRA 
against the Local impl of the LoginContext regarding the lost of the gid.

 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.1

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-02-29 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219487#comment-13219487
 ] 

Hudson commented on MAPREDUCE-3728:
---

Integrated in Hadoop-Common-trunk-Commit #1811 (See 
[https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1811/])
MAPREDUCE-3728. ShuffleHandler can't access results when configured in a 
secure mode (ahmed via tucu) (Revision 1295245)

 Result = SUCCESS
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1295245
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java


 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.3

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-02-29 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219493#comment-13219493
 ] 

Hudson commented on MAPREDUCE-3728:
---

Integrated in Hadoop-Common-0.23-Commit #620 (See 
[https://builds.apache.org/job/Hadoop-Common-0.23-Commit/620/])
Merge -r 1295244:1295245 from trunk to branch. FIXES: MAPREDUCE-3728 
(Revision 1295246)

 Result = SUCCESS
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1295246
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java


 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.3

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-02-29 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219498#comment-13219498
 ] 

Hudson commented on MAPREDUCE-3728:
---

Integrated in Hadoop-Hdfs-trunk-Commit #1886 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1886/])
MAPREDUCE-3728. ShuffleHandler can't access results when configured in a 
secure mode (ahmed via tucu) (Revision 1295245)

 Result = SUCCESS
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1295245
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java


 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.3

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-02-29 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219499#comment-13219499
 ] 

Hudson commented on MAPREDUCE-3728:
---

Integrated in Hadoop-Hdfs-0.23-Commit #609 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/609/])
Merge -r 1295244:1295245 from trunk to branch. FIXES: MAPREDUCE-3728 
(Revision 1295246)

 Result = SUCCESS
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1295246
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java


 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.3

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-02-29 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219561#comment-13219561
 ] 

Hudson commented on MAPREDUCE-3728:
---

Integrated in Hadoop-Mapreduce-trunk-Commit #1818 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1818/])
MAPREDUCE-3728. ShuffleHandler can't access results when configured in a 
secure mode (ahmed via tucu) (Revision 1295245)

 Result = ABORTED
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1295245
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestContainerLocalizer.java


 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.3

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-02-28 Thread Roman Shaposhnik (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218324#comment-13218324
 ] 

Roman Shaposhnik commented on MAPREDUCE-3728:
-

This patch seems to be working in my case. I'd recommend including it in the 
trunk and .23 branch.

 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.1

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-02-23 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214553#comment-13214553
 ] 

Hadoop QA commented on MAPREDUCE-3728:
--

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12515738/MAPREDUCE-3728.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in .

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1914//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1914//console

This message is automatically generated.

 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
Assignee: Ahmed Radwan
Priority: Critical
 Fix For: 0.23.1

 Attachments: MAPREDUCE-3728.patch


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-02-09 Thread Roman Shaposhnik (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204799#comment-13204799
 ] 

Roman Shaposhnik commented on MAPREDUCE-3728:
-

Here's a more direct way to reproduce the problem.

{noformat}
# sudo - yarn
yarn$ mkdir -p /tmp/TEST/{logs,locs} /tmp/TEST/locs/usercache
yarn$ cp /tmp/cont1.tokens /tmp/TEST/cont1.tokens
yarn$ container-executor rvs 0 app1 /tmp/TEST/cont1.tokens /tmp/TEST/locs 
/tmp/TEST/logs /usr/java/jdk1.6.0_26/jre/bin/java -classpath 
/usr/lib/hadoop/lib/\*:/usr/lib/hadoop/\*:/etc/hadoop/conf/nm-config/log4j.properties:/etc/hadoop/conf
 -Djava.library.path=/usr/lib/hadoop/lib/native 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer
 rvs app1 cont1 0.0.0.0 4344 /tmp/TEST/locs 

main : command provided 0
main : user is rvs
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
12/02/09 11:54:40 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
12/02/09 11:54:40 WARN conf.Configuration: mapred-site.xml:an attempt to 
override final parameter: mapreduce.cluster.local.dir;  Ignoring.
=== Using localizerTokenSecurityInfo12/02/09 11:54:41 INFO ipc.Client: 
Retrying connect to server: 0.0.0.0/0.0.0.0:4344. Already tried 0 time(s).
12/02/09 11:54:42 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 1 time(s).
12/02/09 11:54:43 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 2 time(s).
12/02/09 11:54:44 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 3 time(s).
12/02/09 11:54:45 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 4 time(s).
12/02/09 11:54:46 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 5 time(s).
12/02/09 11:54:47 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 6 time(s).
12/02/09 11:54:48 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 7 time(s).
12/02/09 11:54:49 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 8 time(s).
12/02/09 11:54:50 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:4344. Already tried 9 time(s).
java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:221)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:345)
Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: 
Call From c0506.hal.cloudera.com/172.29.81.158 to 0.0.0.0:4344 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see:  http://wiki.apache.org/hadoop/ConnectionRefused
at 
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:148)
at $Proxy6.heartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:54)
... 3 more
Caused by: java.net.ConnectException: Call From 
c0506.hal.cloudera.com/172.29.81.158 to 0.0.0.0:4344 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:686)
at org.apache.hadoop.ipc.Client.call(Client.java:1141)
at org.apache.hadoop.ipc.Client.call(Client.java:1100)
at 
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:145)
... 5 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 

[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-01-25 Thread Vinod Kumar Vavilapalli (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193358#comment-13193358
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-3728:


bq. Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
sticky bit from output and filecache.
May be I am missing something, but I don't understand this. The output-dir is 
being created with null FilePermissions which becomes 0777.

Instead I suspect that the umask for your testuser is strict. That my be the 
reason why the files are getting created with 640. Can you please check?

Also, just to be sure, you are using DefaultContainerExecutor or 
LinuxContainerExecutor? It is mostly the later, but just confirming.

bq. My understanding was that the whole sequence of events here was predicated 
on a sticky bit set so that daemons running under the user yarn (default group 
yarn) can have access to the resulting files and subdirectories down at output 
and below.
Yes, that is the case with YARN and even mrv1/1.0.

bq. On a related note, when the shuffle side of the Pi job failed the job 
itself didn't. It went into the endless loop [..]
Yes, this is a known issue MAPREDUCE-3418.

 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
 Fix For: 0.23.1


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log files (at which point the nodemanager died and thus the job 
 ended). Perhaps
 this is even more serious side effect of this issue that needs to be 
 investigated 
 separately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (MAPREDUCE-3728) ShuffleHandler can't access results when configured in a secure mode

2012-01-25 Thread Roman Shaposhnik (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193377#comment-13193377
 ] 

Roman Shaposhnik commented on MAPREDUCE-3728:
-

.bq May be I am missing something, but I don't understand this. The output-dir 
is being created with null FilePermissions which becomes 0777.

I suspect this is not about *what* permissions it is create with, but the sheer 
fact that permissions are set at all. Here's an example of how it goes wrong 
when permissions are set explicitly. When that happens sticky bit gets lost.
Here's a trivial example. As you can see, when a subdir is created it is, at 
first, retains the sticky bit. But when
I set permissions explicitly it gets cleared (as it should):

{noformat}
rvs@ahmed-laptop:/tmp$ cd /tmp
rvs@ahmed-laptop:/tmp$ mkdir sticky.dir
rvs@ahmed-laptop:/tmp$ sudo chgrp root sticky.dir 
rvs@ahmed-laptop:/tmp$ sudo chmod g+s sticky.dir
rvs@ahmed-laptop:/tmp$ mkdir sticky.dir/subdir
rvs@ahmed-laptop:/tmp$ ls -l sticky.dir
total 4
drwxr-sr-x 2 rvs root 4096 2012-01-25 13:50 subdir
rvs@ahmed-laptop:/tmp$ chmod 777 sticky.dir/subdir
rvs@ahmed-laptop:/tmp$ ls -l sticky.dir
total 4
drwxrwxrwx 2 rvs root 4096 2012-01-25 13:50 subdir
{noformat}

.bq Instead I suspect that the umask for your testuser is strict. That my be 
the reason why the files are getting created with 640. Can you please check?

The problem is not file permissions per se, but the fact that subdirs (starting 
from output) have lost a group sticky bit and thus everything that is created 
underneath them has user group as a default (and of course no longer retains 
the sticky group bit on the subdirectories). But anyway, I've checked and the 
umask for testuser is 0022

.bq Also, just to be sure, you are using DefaultContainerExecutor or 
LinuxContainerExecutor? It is mostly the later, but just confirming.

This is LinuxContainerExecutor since, to the best of my knowledge, one has to 
configure it for the fully secure cluster.

 ShuffleHandler can't access results when configured in a secure mode
 

 Key: MAPREDUCE-3728
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3728
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: mrv2, nodemanager
Affects Versions: 0.23.0
Reporter: Roman Shaposhnik
 Fix For: 0.23.1


 While running the simplest of jobs (Pi) on MR2 in a fully secure 
 configuration I have noticed that the job was failing on the reduce side with 
 the following messages littering the nodemanager logs:
 {noformat}
 2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: 
 Shuffle error
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
 usercache/rvs/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_03_0/file.out.index
  in any of the configured local directories
 {noformat}
 While digging further I found out that the permissions on the files/dirs were 
 prohibiting nodemanager (running under the user yarn) to access these files:
 {noformat}
 $ ls -l 
 /data/3/yarn/usercache/testuser/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_01_0
 -rw-r- 1 testuser testuser 28 Jan 20 15:41 file.out
 -rw-r- 1 testuser testuser 32 Jan 20 15:41 file.out.index
 {noformat}
 Digging even further revealed that the group-sticky bit that was faithfully 
 put on all the subdirectories between testuser and 
 application_1327102703969_0001 was gone from output and 
 attempt_1327102703969_0001_m_01_0. 
 Looking into how these subdirectories are created 
 (org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.initDirs())
 {noformat}
   // $x/usercache/$user/appcache/$appId/filecache
   Path appFileCacheDir = new Path(appBase, FILECACHE);
   appsFileCacheDirs[i] = appFileCacheDir.toString();
   lfs.mkdir(appFileCacheDir, null, false);
   // $x/usercache/$user/appcache/$appId/output
   lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);
 {noformat}
 Reveals that lfs.mkdir ends up manipulating permissions and thus clears 
 sticky bit from output and filecache.
 At this point I'm at a loss about how this is supposed to work. My 
 understanding was
 that the whole sequence of events here was predicated on a sticky bit set so
 that daemons running under the user yarn (default group yarn) can have access
 to the resulting files and subdirectories down at output and below. Please let
 me know if I'm missing something or whether this is just a bug that needs to 
 be fixed.
 On a related note, when the shuffle side of the Pi job failed the job itself 
 didn't.
 It went into the endless loop and only exited when it exhausted all the local 
 storage
 for the log