Hey Parth, I don't think its possible to run MR by basing the FS over S3 completely. You can use S3 for I/O for your files, but your fs.default.name (or fs.defaultFS) must be either file:/// or hdfs:// filesystems. This way, your MR framework can run/distribute its files well, and also still be able to process S3 URLs passed as input or output locations.
On Tue, Oct 23, 2012 at 11:02 PM, Parth Savani <pa...@sensenetworks.com> wrote: > Hello Everyone, > I am trying to run a hadoop job with s3n as my filesystem. > I changed the following properties in my hdfs-site.xml > > fs.default.name=s3n://KEY:VALUE@bucket/ > mapreduce.jobtracker.staging.root.dir=s3n://KEY:VALUE@bucket/tmp > > When i run the job from ec2, I get the following error > > The ownership on the staging directory > s3n://KEY:VALUE@bucket/tmp/ec2-user/.staging is not as expected. It is owned > by The directory must be owned by the submitter ec2-user or by ec2-user > at > org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:113) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:844) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:844) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:481) > > I am using cloudera CDH4 hadoop distribution. The error is thrown from > JobSubmissionFiles.java class > public static Path getStagingDir(JobClient client, Configuration conf) > throws IOException, InterruptedException { > Path stagingArea = client.getStagingAreaDir(); > FileSystem fs = stagingArea.getFileSystem(conf); > String realUser; > String currentUser; > UserGroupInformation ugi = UserGroupInformation.getLoginUser(); > realUser = ugi.getShortUserName(); > currentUser = UserGroupInformation.getCurrentUser().getShortUserName(); > if (fs.exists(stagingArea)) { > FileStatus fsStatus = fs.getFileStatus(stagingArea); > String owner = fsStatus.getOwner(); > if (!(owner.equals(currentUser) || owner.equals(realUser))) { > throw new IOException("The ownership on the staging directory " + > stagingArea + " is not as expected. " + > "It is owned by " + owner + ". The directory must " + > "be owned by the submitter " + currentUser + " or " + > "by " + realUser); > } > if (!fsStatus.getPermission().equals(JOB_DIR_PERMISSION)) { > LOG.info("Permissions on staging directory " + stagingArea + " are " > + > "incorrect: " + fsStatus.getPermission() + ". Fixing permissions " > + > "to correct value " + JOB_DIR_PERMISSION); > fs.setPermission(stagingArea, JOB_DIR_PERMISSION); > } > } else { > fs.mkdirs(stagingArea, > new FsPermission(JOB_DIR_PERMISSION)); > } > return stagingArea; > } > > > > I think my job calls getOwner() which returns NULL since s3 does not have > file permissions which results in the IO exception that i am getting. > > Any workaround for this? Any idea how i could you s3 as the filesystem with > hadoop on distributed mode? -- Harsh J