I used Cloudera's with Mahout to test the Decision Forest implementation. --- En date de : Lun 11.1.10, Grant Ingersoll <[email protected]> a écrit :
> De: Grant Ingersoll <[email protected]> > Objet: Re: Re : Good starting instance for AMI > À: [email protected] > Date: Lundi 11 Janvier 2010, 20h51 > One quick question for all who > responded: > How many have tried Mahout with the setup they > recommended? > > -Grant > > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > > > Some comments on Cloudera's Hadoop (CDH) and Elastic > MapReduce (EMR). > > > > I have used both to get hadoop jobs up and running > (although my EMR use has > > mostly been limited to running batch Pig scripts > weekly). Deciding on which > > one to use really depends on what kind of job/data > you're working with. > > > > EMR is most useful if you're already storing the > dataset you're using on S3 > > and plan on running a one-off job. My understanding is > that it's configured > > to use jets3t to stream data from s3 rather than > copying it to the cluster, > > which is fine for a single pass over a small to medium > sized dataset, but > > obviously slower for multiple passes or larger > datasets. The API is also > > useful if you have a set workflow that you plan to run > on a regular basis, > > and I often prototype quick and dirty jobs on very > small EMR clusters to > > test how some things run in the wild (obviously not > the most cost effective > > solution, but I've foudn pseudo-distributed mode > doesn't catch everything). > > > > CDH gives you greater control over the initial setup > and configuration of > > your cluster. From my understanding, it's not really > an AMI. Rather, it's a > > set of Python scripts that's been modified from the > ec2 scripts from > > hadoop/contrib with some nifty additions like being > able to specify and set > > up EBS volumes, proxy on the cluster, and some others. > The scripts use the > > boto Python module (a very useful Python module for > working with EC2) to > > make a request to EC2 to setup a specified sized > cluster with whatever > > vanilla AMI that's specified. It sets up the security > groups and opens up > > the relevant ports and it then passes the init script > to each of the > > instances once they've booted (same user-data file > setup which is limited to > > 16K I believe). The init script tells each node to > download hadoop (from > > Clouderas OS-specific repos) and any other > user-specified packages and set > > them up. The hadoop config xml is hardcoded into the > init script (although > > you can pass a modified config beforehand). The master > is started first, and > > then the slaves are started so that the slaves can be > given info about what > > NN and JT to connect to (the config uses the public > DNS I believe to make > > things easier to set up). You can use either 0.18.3 > (CDH) or 0.20 (CDH2) > > when it comes to Hadoop versions, although I've had > mixed results with the > > latter. > > > > Personally, I'd still like some kind of facade or > something similar to > > further abstract things and make it easier for others > to quickly set up > > ad-hoc clusters for 'quick n dirty' jobs. I know other > libraries like Crane > > have been released recently, but given the language of > choice (Clojure), I > > haven't yet had a chance to really investigate. > > > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[email protected]> > wrote: > > > >> Just use several of these files. > >> > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin > <[email protected] > >>> wrote: > >> > >>> EMR requires S3 bucket, but S3 instance has a > limit of file > >>> size(5GB), so need some extra care here. Has > any one encounter the file > >>> size > >>> problem on S3 also? I kind of think that it's > unreasonable to have a 5G > >>> size limit when we want to use the system to > deal with large data set. > >>> > >> > >> > >> > >> -- > >> Ted Dunning, CTO > >> DeepDyve > >> > > > > > > > > -- > > Zaki Rahaman > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >
