Some comments on Cloudera's Hadoop (CDH) and Elastic MapReduce (EMR).

I have used both to get hadoop jobs up and running (although my EMR use has
mostly been limited to running batch Pig scripts weekly). Deciding on which
one to use really depends on what kind of job/data you're working with.

EMR is most useful if you're already storing the dataset you're using on S3
and plan on running a one-off job. My understanding is that it's configured
to use jets3t to stream data from s3 rather than copying it to the cluster,
which is fine for a single pass over a small to medium sized dataset, but
obviously slower for multiple passes or larger datasets. The API is also
useful if you have a set workflow that you plan to run on a regular basis,
and I often prototype quick and dirty jobs on very small EMR clusters to
test how some things run in the wild (obviously not the most cost effective
solution, but I've foudn pseudo-distributed mode doesn't catch everything).

CDH gives you greater control over the initial setup and configuration of
your cluster. From my understanding, it's not really an AMI. Rather, it's a
set of Python scripts that's been modified from the ec2 scripts from
hadoop/contrib with some nifty additions like being able to specify and set
up EBS volumes, proxy on the cluster, and some others. The scripts use the
boto Python module (a very useful Python module for working with EC2) to
make a request to EC2 to setup a specified sized cluster with whatever
vanilla AMI that's specified. It sets up the security groups and opens up
the relevant ports and it then passes the init script to each of the
instances once they've booted (same user-data file setup which is limited to
16K I believe). The init script tells each node to download hadoop (from
Clouderas OS-specific repos) and any other user-specified packages and set
them up. The hadoop config xml is hardcoded into the init script (although
you can pass a modified config beforehand). The master is started first, and
then the slaves are started so that the slaves can be given info about what
NN and JT to connect to (the config uses the public DNS I believe to make
things easier to set up). You can use either 0.18.3 (CDH) or 0.20 (CDH2)
when it comes to Hadoop versions, although I've had mixed results with the
latter.

Personally, I'd still like some kind of facade or something similar to
further abstract things and make it easier for others to quickly set up
ad-hoc clusters for 'quick n dirty' jobs. I know other libraries like Crane
have been released recently, but given the language of choice (Clojure), I
haven't yet had a chance to really investigate.

On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[email protected]> wrote:

> Just use several of these files.
>
> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin <[email protected]
> >wrote:
>
> > EMR requires S3 bucket, but S3 instance has a limit of file
> > size(5GB), so need some extra care here. Has any one encounter the file
> > size
> > problem on S3 also? I kind of think that it's unreasonable to have a  5G
> > size limit when we want to use the system to deal with large data set.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
Zaki Rahaman

Reply via email to