Re: Mahout on Elastic MapReduce

Otis Gospodnetic Tue, 14 Apr 2009 14:08:56 -0700

Hadoop should be able to read directly from S3, I believe: 
http://wiki.apache.org/hadoop/AmazonS3


 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Sean Owen <[email protected]>
> To: [email protected]
> Sent: Tuesday, April 14, 2009 4:19:51 PM
> Subject: Re: Mahout on Elastic MapReduce
> 
> This is a fairly uninformed observation, but: the error seems to be
> from Hadoop. It seems to say that it understands hdfs:, but not s3n:,
> and that makes sense to me. Do we expect Hadoop understands how to
> read from S3? I would expect not. (Though, you point to examples that
> seem to overcome this just fine?)
> 
> When I have integrated code with stuff stored on S3, I have always had
> to write extra glue code to copy from S3 to a local file system, do
> work, then copy back.
> 
> On Tue, Apr 14, 2009 at 9:01 PM, Stephen Green wrote:
> >
> > On Apr 14, 2009, at 2:41 PM, Stephen Green wrote:
> >
> >>
> >> On Apr 14, 2009, at 12:51 PM, Jeff Eastman wrote:
> >>
> >>> Hi Stephen,
> >>>
> >>> You are out on the bleeding edge with EMR.
> >>
> >> Yeah, but the view is lovely from here!
> >>
> >>> I've been able to run the kmeans example directly on a small EC2 cluster
> >>> that I started up myself (using the Hadoop src/contrib/ec2 scripts). I 
> >>> have
> >>> not yet tried EMR (just got an account yesterday), but I see that it
> >>> requires you to have your data in S3 as opposed to HDFS.
> >>>
> >>> The job first runs the InputDriver to copy the raw test data into Mahout
> >>> Vector external representation after deleting any pre-existing output 
> >>> files.
> >>> It looks to me like the two delete() snippets you show are pretty
> >>> equivalent. If you have no pre-existing output directory, the Mahout 
> >>> snippet
> >>> won't attempt to delete it.
> >>
> >> I managed to figure that out :-)  I'm pretty comfortable with the ideas
> >> behind MapReduce, but being confronted with my first Job is a bit more
> >> daunting than I expected.
> >>
> >>> I too am at a loss to explain what you are seeing. If you can post more
> >>> results I can try to help you read the tea leaves...
> >>
> >> I noticed that the CloudBurst job just deleted the directory without
> >> checking for existence and so I tried the same thing with Mahout:
> >>
> >> java.lang.IllegalArgumentException: Wrong FS: s3n://mahout-output,
> >> expected: hdfs://domU-12-31-38-00-6C-86.compute-1
> >> .internal:9000
> >>       at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
> >>       at
> >> 
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> >>       at
> >> 
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> >>       at
> >> 
> org.apache.hadoop.dfs.DistributedFileSystem.delete(DistributedFileSystem.java:210)
> >>       at
> >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:83)
> >>       at
> >> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:46)
> >>
> >> So no joy there.
> >>
> >> Should I see if I can isolate this as an s3n problem?  I suppose I could
> >> try running the Hadoop job locally with it reading and writing the data 
> >> from
> >> S3 and see if it suffers from the same problem.  At least then I could 
> >> debug
> >> inside Hadoop.
> >>
> >> Of course, I'm doing all this in Hadoop 0.18.3, and if it is an s3n
> >> problem it might have been fixed already.  That doesn't help much running 
> >> on
> >> EMR, I guess.
> >>
> >> I'm also going to start a run on EMR that does away with the whole
> >> exists/delete check and see if that works.
> >
> > Following up to myself (my wife will tell you that I talk to myself!)  I
> > removed a number of the exists/delete checks:  in CanopyClusteringJob,
> > CanopyDriver, KMeansDriver, and ClusterDriver.  This allowed the jobs to
> > progress, but they died the death a little later with the following
> > exception (and a few more, I can send the whole log if you like):
> >
> > java.lang.IllegalArgumentException: Wrong FS:
> > s3n://mahoutput/canopies/part-00000, expected:
> > hdfs://domU-12-31-39-00-A5-44.compute-1.internal:9000
> >        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:320)
> >        at
> > 
> org.apache.hadoop.dfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:84)
> >        at
> > 
> org.apache.hadoop.dfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:140)
> >        at
> > 
> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:408)
> >        at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:695)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1420)
> >        at
> > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1415)
> >        at
> > 
> org.apache.mahout.clustering.canopy.ClusterMapper.configure(ClusterMapper.java:69)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
> >        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
> >        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:223)
> >        at
> > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
> >
> > Looking at the exception message there, I would almost swear that it things
> > the whole s3n path is the name of a FS that it doesn't know about, but that
> > might just be a bad message.  This message repeats a few times (retrying
> > failed mappers, I guess?) and then the job fails.
> >
> > One thing that occurred to me:  the mahout examples job has the hadoop
> > 0.19.1 core jar in it.  Could I be seeing some kind of version skew between
> > the hadoop in the job file and the one on EMR?  Although it worked fine with
> > a local 0.18.3, so maybe not.
> >
> > I'm going to see if I can get the stock Mahout to run with s3n inputs and
> > outputs tomorrow and I'll let you all know how that goes.
> >
> > Steve
> > --
> > Stephen Green                      //   [email protected]
> > Principal Investigator             \\   http://blogs.sun.com/searchguy
> > Aura Project                       //   Voice: +1 781-442-0926
> > Sun Microsystems Labs              \\   Fax:   +1 781-442-1692
> >
> >
> >
> >

Re: Mahout on Elastic MapReduce

Reply via email to