Some comments on Cloudera's Hadoop (CDH) and Elastic MapReduce (EMR). I have used both to get hadoop jobs up and running (although my EMR use has mostly been limited to running batch Pig scripts weekly). Deciding on which one to use really depends on what kind of job/data you're working with.
EMR is most useful if you're already storing the dataset you're using on S3 and plan on running a one-off job. My understanding is that it's configured to use jets3t to stream data from s3 rather than copying it to the cluster, which is fine for a single pass over a small to medium sized dataset, but obviously slower for multiple passes or larger datasets. The API is also useful if you have a set workflow that you plan to run on a regular basis, and I often prototype quick and dirty jobs on very small EMR clusters to test how some things run in the wild (obviously not the most cost effective solution, but I've foudn pseudo-distributed mode doesn't catch everything). CDH gives you greater control over the initial setup and configuration of your cluster. From my understanding, it's not really an AMI. Rather, it's a set of Python scripts that's been modified from the ec2 scripts from hadoop/contrib with some nifty additions like being able to specify and set up EBS volumes, proxy on the cluster, and some others. The scripts use the boto Python module (a very useful Python module for working with EC2) to make a request to EC2 to setup a specified sized cluster with whatever vanilla AMI that's specified. It sets up the security groups and opens up the relevant ports and it then passes the init script to each of the instances once they've booted (same user-data file setup which is limited to 16K I believe). The init script tells each node to download hadoop (from Clouderas OS-specific repos) and any other user-specified packages and set them up. The hadoop config xml is hardcoded into the init script (although you can pass a modified config beforehand). The master is started first, and then the slaves are started so that the slaves can be given info about what NN and JT to connect to (the config uses the public DNS I believe to make things easier to set up). You can use either 0.18.3 (CDH) or 0.20 (CDH2) when it comes to Hadoop versions, although I've had mixed results with the latter. Personally, I'd still like some kind of facade or something similar to further abstract things and make it easier for others to quickly set up ad-hoc clusters for 'quick n dirty' jobs. I know other libraries like Crane have been released recently, but given the language of choice (Clojure), I haven't yet had a chance to really investigate. On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[email protected]> wrote: > Just use several of these files. > > On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin <[email protected] > >wrote: > > > EMR requires S3 bucket, but S3 instance has a limit of file > > size(5GB), so need some extra care here. Has any one encounter the file > > size > > problem on S3 also? I kind of think that it's unreasonable to have a 5G > > size limit when we want to use the system to deal with large data set. > > > > > > -- > Ted Dunning, CTO > DeepDyve > -- Zaki Rahaman
