I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is wget the Mahout's job files and the data from S3, and launch my job.
--- En date de : Mar 12.1.10, deneche abdelhakim <[email protected]> a écrit : > De: deneche abdelhakim <[email protected]> > Objet: Re: Re : Good starting instance for AMI > À: [email protected] > Date: Mardi 12 Janvier 2010, 3h44 > I used Cloudera's with Mahout to test > the Decision Forest implementation. > > --- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]> > a écrit : > > > De: Grant Ingersoll <[email protected]> > > Objet: Re: Re : Good starting instance for AMI > > À: [email protected] > > Date: Lundi 11 Janvier 2010, 20h51 > > One quick question for all who > > responded: > > How many have tried Mahout with the setup they > > recommended? > > > > -Grant > > > > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > > > > > Some comments on Cloudera's Hadoop (CDH) and > Elastic > > MapReduce (EMR). > > > > > > I have used both to get hadoop jobs up and > running > > (although my EMR use has > > > mostly been limited to running batch Pig scripts > > weekly). Deciding on which > > > one to use really depends on what kind of > job/data > > you're working with. > > > > > > EMR is most useful if you're already storing the > > dataset you're using on S3 > > > and plan on running a one-off job. My > understanding is > > that it's configured > > > to use jets3t to stream data from s3 rather than > > copying it to the cluster, > > > which is fine for a single pass over a small to > medium > > sized dataset, but > > > obviously slower for multiple passes or larger > > datasets. The API is also > > > useful if you have a set workflow that you plan > to run > > on a regular basis, > > > and I often prototype quick and dirty jobs on > very > > small EMR clusters to > > > test how some things run in the wild (obviously > not > > the most cost effective > > > solution, but I've foudn pseudo-distributed mode > > doesn't catch everything). > > > > > > CDH gives you greater control over the initial > setup > > and configuration of > > > your cluster. From my understanding, it's not > really > > an AMI. Rather, it's a > > > set of Python scripts that's been modified from > the > > ec2 scripts from > > > hadoop/contrib with some nifty additions like > being > > able to specify and set > > > up EBS volumes, proxy on the cluster, and some > others. > > The scripts use the > > > boto Python module (a very useful Python module > for > > working with EC2) to > > > make a request to EC2 to setup a specified sized > > cluster with whatever > > > vanilla AMI that's specified. It sets up the > security > > groups and opens up > > > the relevant ports and it then passes the init > script > > to each of the > > > instances once they've booted (same user-data > file > > setup which is limited to > > > 16K I believe). The init script tells each node > to > > download hadoop (from > > > Clouderas OS-specific repos) and any other > > user-specified packages and set > > > them up. The hadoop config xml is hardcoded into > the > > init script (although > > > you can pass a modified config beforehand). The > master > > is started first, and > > > then the slaves are started so that the slaves > can be > > given info about what > > > NN and JT to connect to (the config uses the > public > > DNS I believe to make > > > things easier to set up). You can use either > 0.18.3 > > (CDH) or 0.20 (CDH2) > > > when it comes to Hadoop versions, although I've > had > > mixed results with the > > > latter. > > > > > > Personally, I'd still like some kind of facade > or > > something similar to > > > further abstract things and make it easier for > others > > to quickly set up > > > ad-hoc clusters for 'quick n dirty' jobs. I know > other > > libraries like Crane > > > have been released recently, but given the > language of > > choice (Clojure), I > > > haven't yet had a chance to really investigate. > > > > > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning > <[email protected]> > > wrote: > > > > > >> Just use several of these files. > > >> > > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang > Chenmin > > <[email protected] > > >>> wrote: > > >> > > >>> EMR requires S3 bucket, but S3 instance > has a > > limit of file > > >>> size(5GB), so need some extra care here. > Has > > any one encounter the file > > >>> size > > >>> problem on S3 also? I kind of think that > it's > > unreasonable to have a 5G > > >>> size limit when we want to use the system > to > > deal with large data set. > > >>> > > >> > > >> > > >> > > >> -- > > >> Ted Dunning, CTO > > >> DeepDyve > > >> > > > > > > > > > > > > -- > > > Zaki Rahaman > > > > -------------------------- > > Grant Ingersoll > > http://www.lucidimagination.com/ > > > > Search the Lucene ecosystem using Solr/Lucene: > > http://www.lucidimagination.com/search > > > > > > > >
