Hi Stephen,

It looks to me like you are on the right track. The original kMeans code and job patterns were written over a year ago, probably on a version of Hadoop 10 or 11 IIRC. They have made significant changes to the file system in the interim and nobody - except you - has tried to run kMeans on EMR.

The logic about using the incorrect file system method is sound, and your fix seems like it should work. I don't expect the hadoop version differences to impact you since kMeans has not been updated recently to take advantage of hadoop improvements.

It certainly seems like dfs.exists(outPath) should be false if you have no outPath. You have a sharp machete and are making good progress breaking a jungle trail to EMR. If you'd like to chat on the phone or Skype, please contact me directly (jeff at windwardsolutions dot com).

Jeff


Stephen Green wrote:
A bit more progress. I asked about this problem on Amazon's EMR forums. Here's the thread:

http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945

The answer from Amazon was:

This appears to be an issue with Mahout. This exception is fairly common and matches the pattern of "Wrong FS: s3n://*/, expected: hdfs://*:9000". This occurs when you try and use an S3N path with HDFS. Typically this occurs because the code asks for the wrong FileSystem.

This could happen because a developer used the wrong static method on Hadoop's FileSystem class:

http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html

If you call FileSystem.get(Configuration conf) you'll get an instance of the cluster's default file system, which in our case is HDFS. Instead, if you have a URI and want a reference to the FileSystem that URI points to, you should call the method FileSystem.get(URI uri, Configuration conf).


He offered a solution that involved using DistCp to copy data from S3 to HDFS and then back again, but since I have the Mahout source, I decided to pursue things a bit further. I went into the source and modified the places where the filesystem is fetched to do the following:

    FileSystem dfs = FileSystem.get(outPath.toUri(), conf);

(there were 3 places that I changed it, but I expect there are more lying around.) This is the idiom used by the CloudBurst example on EMR.

Making this change fixes the exception that I was getting, but I'm now getting a different exception:

java.lang.NullPointerException
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:310) at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83) at org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:45)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

(The line numbers in kmeans.Job are weird because I added logging.)

If the Hadoop on EMR is really 0.18.3, then the null pointer here is the store in the NativeS3FileSystem. But there's another problem: I deleted the output path before I started the run, so the existence check should have failed and dfs.delete never should have been called. I added a bit of logging to the KMeans job and here's what it says about the output path:

2009-04-16 14:04:35,757 INFO org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c
lass org.apache.hadoop.fs.s3native.NativeS3FileSystem

So it got the right output file system type.

2009-04-16 14:04:35,758 INFO org.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://
mahout-output/ exists: true

Shouldn't dfs.exists(outPath) fail for a non-existent path? And didn't the store have to exist (i.e., be non-null) for it to figure this out? I guess this really is starting to verge into base hadoop territory.

I'm rapidly getting to the point where I need to solve this one just to prove to myself that I can get it to run!

Steve

Reply via email to