Hi Tim, Thanks for responding -- I believe that I'll need the full power of Hadoop since I'll want this to scale well beyond 100GB of data. Thanks for sharing your experiences -- I'll definitely check out your blog.
Thanks! Ryan On Tue, Sep 2, 2008 at 8:47 AM, tim robertson <[EMAIL PROTECTED]> wrote: > Hi Ryan, > > I actually blogged my experience as it was my first usage of EC2: > http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html > > My input data was not log files but actually a dump if 150million > records from Mysql into about 13 columns of tab file data I believe. > It was a couple of months ago, but I remember thinking S3 was very slow... > > I ran some simple operations like distinct values of one column based > on another (species within a cell) and also did some Polygon analysis > since to do "is this point in this polygon" does not really scale too > well in PostGIS. > > Incidentally, I have most of the basics of a "MapReduce-Lite" which I > aim to port to use the exact Hadoop API since I am *only* working on > 10's-100's GB of data and find that it is running really fine on my > laptop and I don't need the distributed failover. My goal for that > code is for people like me who want to know that I can scale to > terrabyte processing, but don't need to take the plunge to full Hadoop > deployment yet, but will know that I can migrate the processing in the > future as things grow. It runs on the normal filesystem, and single > node only (e.g. multithreaded), and performs very quickly since it is > just doing java NIO bytebuffers in parallel on the underlying > filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a > seconds (simplest of simple map operations). For these small > datasets, you might find it useful - let me know if I should spend > time finishing it (Or submit help?) - it is really very simple. > > Cheers > > Tim > > > > On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >> Hi Tim, >> >> Are you mostly just processing/parsing textual log files? How many >> maps/reduces did you configure in your hadoop-ec2-env.sh file? How >> many did you configure in your JobConf? Just trying to get an idea of >> what to expect in terms of performance. I'm noticing that it takes >> about 16 minutes to transfer about 15GB of textual uncompressed data >> from S3 into HDFS after the cluster has started with 15 nodes. I was >> expecting this to take a shorter amount of time, but maybe I'm >> incorrect in my assumptions. I am also noticing that it takes about 15 >> minutes to parse through the 15GB of data with a 15 node cluster. >> >> Thanks, >> Ryan >> >> >> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote: >>> I have been processing only 100s GBs on EC2, not 1000's and using 20 >>> nodes and really only in exploration and testing phase right now. >>> >>> >>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: >>>> Hi Ryan, >>>> >>>> Just a heads up, if you require more than the 20 node limit, Amazon >>>> provides a form to request a higher limit: >>>> >>>> http://www.amazon.com/gp/html-forms-controller/ec2-request >>>> >>>> Andrew >>>> >>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >>>>> Hello all, >>>>> >>>>> I'm curious to see how many people are using EC2 to execute their >>>>> Hadoop cluster and map/reduce programs, and how many are using >>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a >>>>> bit crippling when one wants to process many gigabytes of data. Has >>>>> anyone found this to be the case? How much data are people processing >>>>> with their 20 node limit on EC2? Curious what the thoughts are... >>>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>> >>> >> >
