There's a case study with some numbers in it from a presentation I gave on Hadoop and AWS in London last month, which you may find interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.
tim robertson <[EMAIL PROTECTED]> wrote: > For these small > datasets, you might find it useful - let me know if I should spend > time finishing it (Or submit help?) - it is really very simple. This sounds very useful. Please consider creating a Jira and submitting the code (even if it's not "finished" folks might like to see it). Thanks. Tom > > Cheers > > Tim > > > > On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >> Hi Tim, >> >> Are you mostly just processing/parsing textual log files? How many >> maps/reduces did you configure in your hadoop-ec2-env.sh file? How >> many did you configure in your JobConf? Just trying to get an idea of >> what to expect in terms of performance. I'm noticing that it takes >> about 16 minutes to transfer about 15GB of textual uncompressed data >> from S3 into HDFS after the cluster has started with 15 nodes. I was >> expecting this to take a shorter amount of time, but maybe I'm >> incorrect in my assumptions. I am also noticing that it takes about 15 >> minutes to parse through the 15GB of data with a 15 node cluster. >> >> Thanks, >> Ryan >> >> >> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: >>> I have been processing only 100s GBs on EC2, not 1000's and using 20 >>> nodes and really only in exploration and testing phase right now. >>> >>> >>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: >>>> Hi Ryan, >>>> >>>> Just a heads up, if you require more than the 20 node limit, Amazon >>>> provides a form to request a higher limit: >>>> >>>> http://www.amazon.com/gp/html-forms-controller/ec2-request >>>> >>>> Andrew >>>> >>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>>> Hello all, >>>>> >>>>> I'm curious to see how many people are using EC2 to execute their >>>>> Hadoop cluster and map/reduce programs, and how many are using >>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a >>>>> bit crippling when one wants to process many gigabytes of data. Has >>>>> anyone found this to be the case? How much data are people processing >>>>> with their 20 node limit on EC2? Curious what the thoughts are... >>>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>> >>> >> >