El 23/10/12 13:32, Parth Savani escribió:
Hello Everyone,
I am trying to run a hadoop job with s3n as my filesystem.
I changed the following properties in my hdfs-site.xml
fs.default.name <http://fs.default.name>=s3n://KEY:VALUE@bucket/
A good practice to this is to use these two properties in the
core-site.xml, if you will use S3 often:
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>AWS_ACCESS_KEY_ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>AWS_SECRET_ACCESS_KEY</value>
</property>
After that, you can access to your URI with a more friendly way:
S3:
s3://<s3-bucket>/<s3-filepath>
S3n:
s3n://<s3-bucket>/<s3-filepath>
mapreduce.jobtracker.staging.root.dir=s3n://KEY:VALUE@bucket/tmp
When i run the job from ec2, I get the following error
The ownership on the staging directory
s3n://KEY:VALUE@bucket/tmp/ec2-user/.staging is not as expected. It is
owned by The directory must be owned by the submitter ec2-user or by
ec2-user
at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:113)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:844)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:844)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:481)
I am using cloudera CDH4 hadoop distribution. The error is thrown from
JobSubmissionFiles.java class
public static Path getStagingDir(JobClient client, Configuration conf)
throws IOException, InterruptedException {
Path stagingArea = client.getStagingAreaDir();
FileSystem fs = stagingArea.getFileSystem(conf);
String realUser;
String currentUser;
UserGroupInformation ugi = UserGroupInformation.getLoginUser();
realUser = ugi.getShortUserName();
currentUser =
UserGroupInformation.getCurrentUser().getShortUserName();
if (fs.exists(stagingArea)) {
FileStatus fsStatus = fs.getFileStatus(stagingArea);
String owner = fsStatus.*getOwner();*
if (!(owner.equals(currentUser) || owner.equals(realUser))) {
throw new IOException("*The ownership on the staging
directory " +*
* stagingArea + " is not as expected. " + *
* "It is owned by " + owner + ". The directory
must " +*
* "be owned by the submitter " + currentUser + "
or " +*
* "by " + realUser*);
}
if (!fsStatus.getPermission().equals(JOB_DIR_PERMISSION)) {
LOG.info("Permissions on staging directory " + stagingArea + "
are " +
"incorrect: " + fsStatus.getPermission() + ". Fixing
permissions " +
"to correct value " + JOB_DIR_PERMISSION);
fs.setPermission(stagingArea, JOB_DIR_PERMISSION);
}
} else {
fs.mkdirs(stagingArea,
new FsPermission(JOB_DIR_PERMISSION));
}
return stagingArea;
}
I think my job calls getOwner() which returns NULL since s3 does not
have file permissions which results in the IO exception that i am
getting.
Which what user are you launching the job in EC2?
Any workaround for this? Any idea how i could you s3 as the filesystem
with hadoop on distributed mode?
Look here:
http://wiki.apache.org/hadoop/AmazonS3
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci