El 23/10/12 13:32, Parth Savani escribió:
Hello Everyone,
        I am trying to run a hadoop job with s3n as my filesystem.
I changed the following properties in my hdfs-site.xml

fs.default.name <http://fs.default.name>=s3n://KEY:VALUE@bucket/
A good practice to this is to use these two properties in the core-site.xml, if you will use S3 often:
<property>
    <name>fs.s3.awsAccessKeyId</name>
    <value>AWS_ACCESS_KEY_ID</value>
</property>

<property>
    <name>fs.s3.awsSecretAccessKey</name>
    <value>AWS_SECRET_ACCESS_KEY</value>
</property>

After that, you can access to your URI with a more friendly way:
S3:
 s3://<s3-bucket>/<s3-filepath>

S3n:
 s3n://<s3-bucket>/<s3-filepath>

mapreduce.jobtracker.staging.root.dir=s3n://KEY:VALUE@bucket/tmp

When i run the job from ec2, I get the following error

The ownership on the staging directory s3n://KEY:VALUE@bucket/tmp/ec2-user/.staging is not as expected. It is owned by The directory must be owned by the submitter ec2-user or by ec2-user at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:113)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:844)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:844)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:481)

I am using cloudera CDH4 hadoop distribution. The error is thrown from JobSubmissionFiles.java class
 public static Path getStagingDir(JobClient client, Configuration conf)
  throws IOException, InterruptedException {
    Path stagingArea = client.getStagingAreaDir();
    FileSystem fs = stagingArea.getFileSystem(conf);
    String realUser;
    String currentUser;
    UserGroupInformation ugi = UserGroupInformation.getLoginUser();
    realUser = ugi.getShortUserName();
currentUser = UserGroupInformation.getCurrentUser().getShortUserName();
    if (fs.exists(stagingArea)) {
      FileStatus fsStatus = fs.getFileStatus(stagingArea);
      String owner = fsStatus.*getOwner();*
      if (!(owner.equals(currentUser) || owner.equals(realUser))) {
throw new IOException("*The ownership on the staging directory " +*
*                      stagingArea + " is not as expected. " + *
* "It is owned by " + owner + ". The directory must " +* * "be owned by the submitter " + currentUser + " or " +*
*                      "by " + realUser*);
      }
      if (!fsStatus.getPermission().equals(JOB_DIR_PERMISSION)) {
LOG.info("Permissions on staging directory " + stagingArea + " are " + "incorrect: " + fsStatus.getPermission() + ". Fixing permissions " +
          "to correct value " + JOB_DIR_PERMISSION);
        fs.setPermission(stagingArea, JOB_DIR_PERMISSION);
      }
    } else {
      fs.mkdirs(stagingArea,
          new FsPermission(JOB_DIR_PERMISSION));
    }
    return stagingArea;
  }


I think my job calls getOwner() which returns NULL since s3 does not have file permissions which results in the IO exception that i am getting.
Which what user are you launching the job in EC2?



Any workaround for this? Any idea how i could you s3 as the filesystem with hadoop on distributed mode?

Look here:
http://wiki.apache.org/hadoop/AmazonS3



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Reply via email to