Hi Ryan, I actually blogged my experience as it was my first usage of EC2: http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html
My input data was not log files but actually a dump if 150million records from Mysql into about 13 columns of tab file data I believe. It was a couple of months ago, but I remember thinking S3 was very slow... I ran some simple operations like distinct values of one column based on another (species within a cell) and also did some Polygon analysis since to do "is this point in this polygon" does not really scale too well in PostGIS. Incidentally, I have most of the basics of a "MapReduce-Lite" which I aim to port to use the exact Hadoop API since I am *only* working on 10's-100's GB of data and find that it is running really fine on my laptop and I don't need the distributed failover. My goal for that code is for people like me who want to know that I can scale to terrabyte processing, but don't need to take the plunge to full Hadoop deployment yet, but will know that I can migrate the processing in the future as things grow. It runs on the normal filesystem, and single node only (e.g. multithreaded), and performs very quickly since it is just doing java NIO bytebuffers in parallel on the underlying filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a seconds (simplest of simple map operations). For these small datasets, you might find it useful - let me know if I should spend time finishing it (Or submit help?) - it is really very simple. Cheers Tim On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Hi Tim, > > Are you mostly just processing/parsing textual log files? How many > maps/reduces did you configure in your hadoop-ec2-env.sh file? How > many did you configure in your JobConf? Just trying to get an idea of > what to expect in terms of performance. I'm noticing that it takes > about 16 minutes to transfer about 15GB of textual uncompressed data > from S3 into HDFS after the cluster has started with 15 nodes. I was > expecting this to take a shorter amount of time, but maybe I'm > incorrect in my assumptions. I am also noticing that it takes about 15 > minutes to parse through the 15GB of data with a 15 node cluster. > > Thanks, > Ryan > > > On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: >> I have been processing only 100s GBs on EC2, not 1000's and using 20 >> nodes and really only in exploration and testing phase right now. >> >> >> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: >>> Hi Ryan, >>> >>> Just a heads up, if you require more than the 20 node limit, Amazon >>> provides a form to request a higher limit: >>> >>> http://www.amazon.com/gp/html-forms-controller/ec2-request >>> >>> Andrew >>> >>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>> Hello all, >>>> >>>> I'm curious to see how many people are using EC2 to execute their >>>> Hadoop cluster and map/reduce programs, and how many are using >>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a >>>> bit crippling when one wants to process many gigabytes of data. Has >>>> anyone found this to be the case? How much data are people processing >>>> with their 20 node limit on EC2? Curious what the thoughts are... >>>> >>>> Thanks, >>>> Ryan >>>> >>> >> >