Hadoop should be able to read directly from S3, I believe: http://wiki.apache.org/hadoop/AmazonS3
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Sean Owen <[email protected]> > To: [email protected] > Sent: Tuesday, April 14, 2009 4:19:51 PM > Subject: Re: Mahout on Elastic MapReduce > > This is a fairly uninformed observation, but: the error seems to be > from Hadoop. It seems to say that it understands hdfs:, but not s3n:, > and that makes sense to me. Do we expect Hadoop understands how to > read from S3? I would expect not. (Though, you point to examples that > seem to overcome this just fine?) > > When I have integrated code with stuff stored on S3, I have always had > to write extra glue code to copy from S3 to a local file system, do > work, then copy back. > > On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green wrote: > > > > On Apr 14, 2009, at 2:41 PM, Stephen Green wrote: > > > >> > >> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote: > >> > >>> Hi Stephen, > >>> > >>> You are out on the bleeding edge with EMR. > >> > >> Yeah, but the view is lovely from here! > >> > >>> I've been able to run the kmeans example directly on a small EC2 cluster > >>> that I started up myself (using the Hadoop src/contrib/ec2 scripts). I > >>> have > >>> not yet tried EMR (just got an account yesterday), but I see that it > >>> requires you to have your data in S3 as opposed to HDFS. > >>> > >>> The job first runs the InputDriver to copy the raw test data into Mahout > >>> Vector external representation after deleting any pre-existing output > >>> files. > >>> It looks to me like the two delete() snippets you show are pretty > >>> equivalent. If you have no pre-existing output directory, the Mahout > >>> snippet > >>> won't attempt to delete it. > >> > >> I managed to figure that out :-) I'm pretty comfortable with the ideas > >> behind MapReduce, but being confronted with my first Job is a bit more > >> daunting than I expected. > >> > >>> I too am at a loss to explain what you are seeing. If you can post more > >>> results I can try to help you read the tea leaves... > >> > >> I noticed that the CloudBurst job just deleted the directory without > >> checking for existence and so I tried the same thing with Mahout: > >> > >> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output, > >> expected: hdfs://domU-12-31-38-00-6C-86.compute-1 > >> .internal:9000 > >> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320) > >> at > >> > org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84) > >> at > >> > org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140) > >> at > >> > org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210) > >> at > >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83) > >> at > >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46) > >> > >> So no joy there. > >> > >> Should I see if I can isolate this as an s3n problem? I suppose I could > >> try running the Hadoop job locally with it reading and writing the data > >> from > >> S3 and see if it suffers from the same problem. At least then I could > >> debug > >> inside Hadoop. > >> > >> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n > >> problem it might have been fixed already. That doesn't help much running > >> on > >> EMR, I guess. > >> > >> I'm also going to start a run on EMR that does away with the whole > >> exists/delete check and see if that works. > > > > Following up to myself (my wife will tell you that I talk to myself!) I > > removed a number of the exists/delete checks: in CanopyClusteringJob, > > CanopyDriver, KMeansDriver, and ClusterDriver. This allowed the jobs to > > progress, but they died the death a little later with the following > > exception (and a few more, I can send the whole log if you like): > > > > java.lang.IllegalArgumentException: Wrong FS: > > s3n://mahoutput/canopies/part-00000, expected: > > hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000 > > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320) > > at > > > org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84) > > at > > > org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140) > > at > > > org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408) > > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:695) > > at > > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420) > > at > > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415) > > at > > > org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69) > > at > > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > > at > > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) > > at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33) > > at > > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > > at > > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223) > > at > > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) > > > > Looking at the exception message there, I would almost swear that it things > > the whole s3n path is the name of a FS that it doesn't know about, but that > > might just be a bad message. This message repeats a few times (retrying > > failed mappers, I guess?) and then the job fails. > > > > One thing that occurred to me: the mahout examples job has the hadoop > > 0.19.1 core jar in it. Could I be seeing some kind of version skew between > > the hadoop in the job file and the one on EMR? Although it worked fine with > > a local 0.18.3, so maybe not. > > > > I'm going to see if I can get the stock Mahout to run with s3n inputs and > > outputs tomorrow and I'll let you all know how that goes. > > > > Steve > > -- > > Stephen Green // [email protected] > > Principal Investigator \\ http://blogs.sun.com/searchguy > > Aura Project // Voice: +1 781-442-0926 > > Sun Microsystems Labs \\ Fax: +1 781-442-1692 > > > > > > > >
