Re: Re : Good starting instance for AMI

deneche abdelhakim Mon, 11 Jan 2010 19:44:06 -0800

I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs Hadoop 
0.20+ . Hadoop is pre-installed and configured all I have to do is wget the 
Mahout's job files and the data from S3, and launch my job.


--- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a écrit :

> De: deneche abdelhakim <[email protected]>
> Objet: Re: Re : Good starting instance for AMI
> À: [email protected]
> Date: Mardi 12 Janvier 2010, 3h44
> I used Cloudera's with Mahout to test
> the Decision Forest implementation.
> 
> --- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]>
> a écrit :
> 
> > De: Grant Ingersoll <[email protected]>
> > Objet: Re: Re : Good starting instance for AMI
> > À: [email protected]
> > Date: Lundi 11 Janvier 2010, 20h51
> > One quick question for all who
> > responded:
> > How many have tried Mahout with the setup they
> > recommended?
> > 
> > -Grant
> > 
> > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote:
> > 
> > > Some comments on Cloudera's Hadoop (CDH) and
> Elastic
> > MapReduce (EMR).
> > > 
> > > I have used both to get hadoop jobs up and
> running
> > (although my EMR use has
> > > mostly been limited to running batch Pig scripts
> > weekly). Deciding on which
> > > one to use really depends on what kind of
> job/data
> > you're working with.
> > > 
> > > EMR is most useful if you're already storing the
> > dataset you're using on S3
> > > and plan on running a one-off job. My
> understanding is
> > that it's configured
> > > to use jets3t to stream data from s3 rather than
> > copying it to the cluster,
> > > which is fine for a single pass over a small to
> medium
> > sized dataset, but
> > > obviously slower for multiple passes or larger
> > datasets. The API is also
> > > useful if you have a set workflow that you plan
> to run
> > on a regular basis,
> > > and I often prototype quick and dirty jobs on
> very
> > small EMR clusters to
> > > test how some things run in the wild (obviously
> not
> > the most cost effective
> > > solution, but I've foudn pseudo-distributed mode
> > doesn't catch everything).
> > > 
> > > CDH gives you greater control over the initial
> setup
> > and configuration of
> > > your cluster. From my understanding, it's not
> really
> > an AMI. Rather, it's a
> > > set of Python scripts that's been modified from
> the
> > ec2 scripts from
> > > hadoop/contrib with some nifty additions like
> being
> > able to specify and set
> > > up EBS volumes, proxy on the cluster, and some
> others.
> > The scripts use the
> > > boto Python module (a very useful Python module
> for
> > working with EC2) to
> > > make a request to EC2 to setup a specified sized
> > cluster with whatever
> > > vanilla AMI that's specified. It sets up the
> security
> > groups and opens up
> > > the relevant ports and it then passes the init
> script
> > to each of the
> > > instances once they've booted (same user-data
> file
> > setup which is limited to
> > > 16K I believe). The init script tells each node
> to
> > download hadoop (from
> > > Clouderas OS-specific repos) and any other
> > user-specified packages and set
> > > them up. The hadoop config xml is hardcoded into
> the
> > init script (although
> > > you can pass a modified config beforehand). The
> master
> > is started first, and
> > > then the slaves are started so that the slaves
> can be
> > given info about what
> > > NN and JT to connect to (the config uses the
> public
> > DNS I believe to make
> > > things easier to set up). You can use either
> 0.18.3
> > (CDH) or 0.20 (CDH2)
> > > when it comes to Hadoop versions, although I've
> had
> > mixed results with the
> > > latter.
> > > 
> > > Personally, I'd still like some kind of facade
> or
> > something similar to
> > > further abstract things and make it easier for
> others
> > to quickly set up
> > > ad-hoc clusters for 'quick n dirty' jobs. I know
> other
> > libraries like Crane
> > > have been released recently, but given the
> language of
> > choice (Clojure), I
> > > haven't yet had a chance to really investigate.
> > > 
> > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning
> <[email protected]>
> > wrote:
> > > 
> > >> Just use several of these files.
> > >> 
> > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang
> Chenmin
> > <[email protected]
> > >>> wrote:
> > >> 
> > >>> EMR requires S3 bucket, but S3 instance
> has a
> > limit of file
> > >>> size(5GB), so need some extra care here.
> Has
> > any one encounter the file
> > >>> size
> > >>> problem on S3 also? I kind of think that
> it's
> > unreasonable to have a  5G
> > >>> size limit when we want to use the system
> to
> > deal with large data set.
> > >>> 
> > >> 
> > >> 
> > >> 
> > >> --
> > >> Ted Dunning, CTO
> > >> DeepDyve
> > >> 
> > > 
> > > 
> > > 
> > > -- 
> > > Zaki Rahaman
> > 
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> > 
> > Search the Lucene ecosystem using Solr/Lucene: 
> > http://www.lucidimagination.com/search
> > 
> > 
> 
> 
> 
>

Re: Re : Good starting instance for AMI

Reply via email to