Re: Hadoop & EC2

Ryan LeCompte Tue, 02 Sep 2008 05:54:53 -0700

Hi Tim,

Thanks for responding -- I believe that I'll need the full power of
Hadoop since I'll want this to scale well beyond 100GB of data. Thanks
for sharing your experiences -- I'll definitely check out your blog.


Thanks!

Ryan


On Tue, Sep 2, 2008 at 8:47 AM, tim robertson <[EMAIL PROTECTED]> wrote:
> Hi Ryan,
>
> I actually blogged my experience as it was my first usage of EC2:
> http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html
>
> My input data was not log files but actually a dump if 150million
> records from Mysql into about 13 columns of tab file data I believe.
> It was a couple of months ago, but I remember thinking S3 was very slow...
>
> I ran some simple operations like distinct values of one column based
> on another (species within a cell) and also did some Polygon analysis
> since to do "is this point in this polygon" does not really scale too
> well in PostGIS.
>
> Incidentally, I have most of the basics of a "MapReduce-Lite" which I
> aim to port to use the exact Hadoop API since I am *only* working on
> 10's-100's GB of data and find that it is running really fine on my
> laptop and I don't need the distributed failover.  My goal for that
> code is for people like me who want to know that I can scale to
> terrabyte processing, but don't need to take the plunge to full Hadoop
> deployment yet, but will know that I can migrate the processing in the
> future as  things grow.  It runs on the normal filesystem, and single
> node only (e.g. multithreaded), and performs very quickly since it is
> just doing java NIO bytebuffers in parallel on the underlying
> filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a
> seconds (simplest of simple map operations).  For these small
> datasets, you might find it useful - let me know if I should spend
> time finishing it (Or submit help?) - it is really very simple.
>
> Cheers
>
> Tim
>
>
>
> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>> Hi Tim,
>>
>> Are you mostly just processing/parsing textual log files? How many
>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>> many did you configure in your JobConf? Just trying to get an idea of
>> what to expect in terms of performance. I'm noticing that it takes
>> about 16 minutes to transfer about 15GB of textual uncompressed data
>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>> expecting this to take a shorter amount of time, but maybe I'm
>> incorrect in my assumptions. I am also noticing that it takes about 15
>> minutes to parse through the 15GB of data with a 15 node cluster.
>>
>> Thanks,
>> Ryan
>>
>>
>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote:
>>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>>> nodes and really only in exploration and testing phase right now.
>>>
>>>
>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote:
>>>> Hi Ryan,
>>>>
>>>> Just a heads up, if you require more than the 20 node limit, Amazon
>>>> provides a form to request a higher limit:
>>>>
>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>>>> Hello all,
>>>>>
>>>>> I'm curious to see how many people are using EC2 to execute their
>>>>> Hadoop cluster and map/reduce programs, and how many are using
>>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a
>>>>> bit crippling when one wants to process many gigabytes of data. Has
>>>>> anyone found this to be the case? How much data are people processing
>>>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>
>>>
>>
>

Re: Hadoop & EC2

Reply via email to