I used Cloudera's with Mahout to test the Decision Forest implementation.

--- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]> a écrit :

> De: Grant Ingersoll <[email protected]>
> Objet: Re: Re : Good starting instance for AMI
> À: [email protected]
> Date: Lundi 11 Janvier 2010, 20h51
> One quick question for all who
> responded:
> How many have tried Mahout with the setup they
> recommended?
> 
> -Grant
> 
> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
> 
> > Some comments on Cloudera's Hadoop (CDH) and Elastic
> MapReduce (EMR).
> > 
> > I have used both to get hadoop jobs up and running
> (although my EMR use has
> > mostly been limited to running batch Pig scripts
> weekly). Deciding on which
> > one to use really depends on what kind of job/data
> you're working with.
> > 
> > EMR is most useful if you're already storing the
> dataset you're using on S3
> > and plan on running a one-off job. My understanding is
> that it's configured
> > to use jets3t to stream data from s3 rather than
> copying it to the cluster,
> > which is fine for a single pass over a small to medium
> sized dataset, but
> > obviously slower for multiple passes or larger
> datasets. The API is also
> > useful if you have a set workflow that you plan to run
> on a regular basis,
> > and I often prototype quick and dirty jobs on very
> small EMR clusters to
> > test how some things run in the wild (obviously not
> the most cost effective
> > solution, but I've foudn pseudo-distributed mode
> doesn't catch everything).
> > 
> > CDH gives you greater control over the initial setup
> and configuration of
> > your cluster. From my understanding, it's not really
> an AMI. Rather, it's a
> > set of Python scripts that's been modified from the
> ec2 scripts from
> > hadoop/contrib with some nifty additions like being
> able to specify and set
> > up EBS volumes, proxy on the cluster, and some others.
> The scripts use the
> > boto Python module (a very useful Python module for
> working with EC2) to
> > make a request to EC2 to setup a specified sized
> cluster with whatever
> > vanilla AMI that's specified. It sets up the security
> groups and opens up
> > the relevant ports and it then passes the init script
> to each of the
> > instances once they've booted (same user-data file
> setup which is limited to
> > 16K I believe). The init script tells each node to
> download hadoop (from
> > Clouderas OS-specific repos) and any other
> user-specified packages and set
> > them up. The hadoop config xml is hardcoded into the
> init script (although
> > you can pass a modified config beforehand). The master
> is started first, and
> > then the slaves are started so that the slaves can be
> given info about what
> > NN and JT to connect to (the config uses the public
> DNS I believe to make
> > things easier to set up). You can use either 0.18.3
> (CDH) or 0.20 (CDH2)
> > when it comes to Hadoop versions, although I've had
> mixed results with the
> > latter.
> > 
> > Personally, I'd still like some kind of facade or
> something similar to
> > further abstract things and make it easier for others
> to quickly set up
> > ad-hoc clusters for 'quick n dirty' jobs. I know other
> libraries like Crane
> > have been released recently, but given the language of
> choice (Clojure), I
> > haven't yet had a chance to really investigate.
> > 
> > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[email protected]>
> wrote:
> > 
> >> Just use several of these files.
> >> 
> >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin
> <[email protected]
> >>> wrote:
> >> 
> >>> EMR requires S3 bucket, but S3 instance has a
> limit of file
> >>> size(5GB), so need some extra care here. Has
> any one encounter the file
> >>> size
> >>> problem on S3 also? I kind of think that it's
> unreasonable to have a  5G
> >>> size limit when we want to use the system to
> deal with large data set.
> >>> 
> >> 
> >> 
> >> 
> >> --
> >> Ted Dunning, CTO
> >> DeepDyve
> >> 
> > 
> > 
> > 
> > -- 
> > Zaki Rahaman
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem using Solr/Lucene: 
> http://www.lucidimagination.com/search
> 
> 



Reply via email to