Re: Mahout on Elastic MapReduce

Jeff Eastman Thu, 16 Apr 2009 08:10:43 -0700

Hi Stephen,

It looks to me like you are on the right track. The original kMeans codeand job patterns were written over a year ago, probably on a version ofHadoop 10 or 11 IIRC. They have made significant changes to the filesystem in the interim and nobody - except you - has tried to run kMeanson EMR.

The logic about using the incorrect file system method is sound, andyour fix seems like it should work. I don't expect the hadoop versiondifferences to impact you since kMeans has not been updated recently totake advantage of hadoop improvements.

It certainly seems like dfs.exists(outPath) should be false if you haveno outPath. You have a sharp machete and are making good progressbreaking a jungle trail to EMR. If you'd like to chat on the phone orSkype, please contact me directly (jeff at windwardsolutions dot com).


Jeff


Stephen Green wrote:

A bit more progress. I asked about this problem on Amazon's EMRforums. Here's the thread:
http://developer.amazonwebservices.com/connect/thread.jspa?threadID=30945

The answer from Amazon was:
This appears to be an issue with Mahout. This exception is fairlycommon and matches the pattern of "Wrong FS: s3n://*/, expected:hdfs://*:9000". This occurs when you try and use an S3N path withHDFS. Typically this occurs because the code asks for the wrongFileSystem.
This could happen because a developer used the wrong static method onHadoop's FileSystem class:
http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/fs/FileSystem.html
If you call FileSystem.get(Configuration conf) you'll get an instanceof the cluster's default file system, which in our case is HDFS.Instead, if you have a URI and want a reference to the FileSystemthat URI points to, you should call the method FileSystem.get(URIuri, Configuration conf).
He offered a solution that involved using DistCp to copy data from S3to HDFS and then back again, but since I have the Mahout source, Idecided to pursue things a bit further. I went into the source andmodified the places where the filesystem is fetched to do the following:
    FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
(there were 3 places that I changed it, but I expect there are morelying around.) This is the idiom used by the CloudBurst example on EMR.
Making this change fixes the exception that I was getting, but I'm nowgetting a different exception:
java.lang.NullPointerException
atorg.apache.hadoop.fs.s3native.NativeS3FileSystem.delete(NativeS3FileSystem.java:310)atorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)atorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:45)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

(The line numbers in kmeans.Job are weird because I added logging.)
If the Hadoop on EMR is really 0.18.3, then the null pointer here isthe store in the NativeS3FileSystem. But there's another problem: Ideleted the output path before I started the run, so the existencecheck should have failed and dfs.delete never should have beencalled. I added a bit of logging to the KMeans job and here's what itsays about the output path:
2009-04-16 14:04:35,757 INFOorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): dfs: c
lass org.apache.hadoop.fs.s3native.NativeS3FileSystem

So it got the right output file system type.
2009-04-16 14:04:35,758 INFOorg.apache.mahout.clustering.syntheticcontrol.kmeans.Job (main): s3n://
mahout-output/ exists: true
Shouldn't dfs.exists(outPath) fail for a non-existent path? Anddidn't the store have to exist (i.e., be non-null) for it to figurethis out? I guess this really is starting to verge into base hadoopterritory.
I'm rapidly getting to the point where I need to solve this one justto prove to myself that I can get it to run!
Steve

Re: Mahout on Elastic MapReduce

Reply via email to