Hi Stephen,
It looks to me like you are on the right track. The original kMeans code
and job patterns were written over a year ago, probably on a version of
Hadoop 10 or 11 IIRC. They have made significant changes to the file
system in the interim and nobody - except you - has tried to run kMeans
on EMR.
The logic about using the incorrect file system method is sound, and
your fix seems like it should work. I don't expect the hadoop version
differences to impact you since kMeans has not been updated recently to
take advantage of hadoop improvements.
It certainly seems like dfs.exists(outPath) should be false if you have
no outPath. You have a sharp machete and are making good progress
breaking a jungle trail to EMR. If you'd like to chat on the phone or
Skype, please contact me directly (jeff at windwardsolutions dot com).
Jeff
Stephen Green wrote:
A bit more progress. I asked about this problem on Amazon's EMR
forums. Here's the thread:
http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945
The answer from Amazon was:
This appears to be an issue with Mahout. This exception is fairly
common and matches the pattern of "Wrong FS: s3n://*/, expected:
hdfs://*:9000". This occurs when you try and use an S3N path with
HDFS. Typically this occurs because the code asks for the wrong
FileSystem.
This could happen because a developer used the wrong static method on
Hadoop's FileSystem class:
http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html
If you call FileSystem.get(Configuration conf) you'll get an instance
of the cluster's default file system, which in our case is HDFS.
Instead, if you have a URI and want a reference to the FileSystem
that URI points to, you should call the method FileSystem.get(URI
uri, Configuration conf).
He offered a solution that involved using DistCp to copy data from S3
to HDFS and then back again, but since I have the Mahout source, I
decided to pursue things a bit further. I went into the source and
modified the places where the filesystem is fetched to do the following:
FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
(there were 3 places that I changed it, but I expect there are more
lying around.) This is the idiom used by the CloudBurst example on EMR.
Making this change fixes the exception that I was getting, but I'm now
getting a different exception:
java.lang.NullPointerException
at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:310)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
(The line numbers in kmeans.Job are weird because I added logging.)
If the Hadoop on EMR is really 0.18.3, then the null pointer here is
the store in the NativeS3FileSystem. But there's another problem: I
deleted the output path before I started the run, so the existence
check should have failed and dfs.delete never should have been
called. I added a bit of logging to the KMeans job and here's what it
says about the output path:
2009-04-16 14:04:35,757 INFO
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c
lass org.apache.hadoop.fs.s3native.NativeS3FileSystem
So it got the right output file system type.
2009-04-16 14:04:35,758 INFO
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://
mahout-output/ exists: true
Shouldn't dfs.exists(outPath) fail for a non-existent path? And
didn't the store have to exist (i.e., be non-null) for it to figure
this out? I guess this really is starting to verge into base hadoop
territory.
I'm rapidly getting to the point where I need to solve this one just
to prove to myself that I can get it to run!
Steve