subject:"\"Hadoop EC2\""

Re: Hadoop & EC2

2008-09-04 Thread Ryan LeCompte

Hi Tom,

This clears up my questions.

Thanks!

Ryan



On Thu, Sep 4, 2008 at 9:21 AM, Tom White <[EMAIL PROTECTED]> wrote:
> On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>> I'm noticing that using bin/hadoop fs -put ... svn://... is uploading
>> multi-gigabyte files in ~64MB chunks.
>
> That's because S3Filesystem stores files as 64MB blocks on S3.
>
>> Then, when this is copied from
>> S3 into HDFS using bin/hadoop distcp. Once the files are there and the
>> job begins, it looks like it's breaking up the 4 multigigabyte text
>> files into about 225 maps. Does this mean that each map is roughly
>> processing 64MB of data each?
>
> Yes, HDFS stores files as 64MB blocks too, and map input is split by
> default so each map processes one block.
>
>>If so, is there any way to change this
>> so that I can get my map tasks to process more data at a time? I'm
>> curious if this will shorten the time it takes to run the program.
>
> You could try increasing the HDFS block size. 128MB is actually
> usually a better value, for this very reason.
>
> In the future https://issues.apache.org/jira/browse/HADOOP-2560 will
> help here too.
>
>>
>> Tom, in your article about Hadoop + EC2 you mention processing about
>> 100GB of logs in under 6 minutes or so.
>
> In this article:
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873,
> it took 35 minutes to run the job. I'm planning on doing some
> benchmarking on EC2 fairly soon, which should help us improve the
> performance of Hadoop on EC2. It's worth remarking that this was
> running on small instances. The larger instances perform a lot better
> in my experience.
>
>> Do you remember how many EC2
>> instances you had running, and also how many map tasks did you have to
>> operate on the 100GB? Was each map task handling about 1GB each?
>
> I was running 20 nodes, and each map task was handling a HDFS block, 64MB.
>
> Hope this helps,
>
> Tom
>

Re: Hadoop & EC2

2008-09-04 Thread Tom White

On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
> I'm noticing that using bin/hadoop fs -put ... svn://... is uploading
> multi-gigabyte files in ~64MB chunks.

That's because S3Filesystem stores files as 64MB blocks on S3.

> Then, when this is copied from
> S3 into HDFS using bin/hadoop distcp. Once the files are there and the
> job begins, it looks like it's breaking up the 4 multigigabyte text
> files into about 225 maps. Does this mean that each map is roughly
> processing 64MB of data each?

Yes, HDFS stores files as 64MB blocks too, and map input is split by
default so each map processes one block.

>If so, is there any way to change this
> so that I can get my map tasks to process more data at a time? I'm
> curious if this will shorten the time it takes to run the program.

You could try increasing the HDFS block size. 128MB is actually
usually a better value, for this very reason.

In the future https://issues.apache.org/jira/browse/HADOOP-2560 will
help here too.

>
> Tom, in your article about Hadoop + EC2 you mention processing about
> 100GB of logs in under 6 minutes or so.

In this article:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873,
it took 35 minutes to run the job. I'm planning on doing some
benchmarking on EC2 fairly soon, which should help us improve the
performance of Hadoop on EC2. It's worth remarking that this was
running on small instances. The larger instances perform a lot better
in my experience.

> Do you remember how many EC2
> instances you had running, and also how many map tasks did you have to
> operate on the 100GB? Was each map task handling about 1GB each?

I was running 20 nodes, and each map task was handling a HDFS block, 64MB.

Hope this helps,

Tom

Re: Hadoop & EC2

2008-09-04 Thread Ryan LeCompte

I'm noticing that using bin/hadoop fs -put ... svn://... is uploading
multi-gigabyte files in ~64MB chunks. Then, when this is copied from
S3 into HDFS using bin/hadoop distcp. Once the files are there and the
job begins, it looks like it's breaking up the 4 multigigabyte text
files into about 225 maps. Does this mean that each map is roughly
processing 64MB of data each? If so, is there any way to change this
so that I can get my map tasks to process more data at a time? I'm
curious if this will shorten the time it takes to run the program.

Tom, in your article about Hadoop + EC2 you mention processing about
100GB of logs in under 6 minutes or so. Do you remember how many EC2
instances you had running, and also how many map tasks did you have to
operate on the 100GB? Was each map task handling about 1GB each?

Thanks,
Ryan


On Wed, Sep 3, 2008 at 11:21 AM, Tom White <[EMAIL PROTECTED]> wrote:
> On Wed, Sep 3, 2008 at 3:05 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>> Tom,
>>
>> I noticed that you mentioned using Amazon's new elastic block store as
>> an alternative to using S3. Right now I'm testing pushing data to S3,
>> then moving it from S3 into HDFS once the Hadoop cluster is up and
>> running in EC2. It works pretty well -- moving data from S3 to HDFS is
>> fast when the data in S3 is broken up into multiple files, since
>> bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the
>> data.
>
> Yes, this is a good-enough solution for many applications.
>
>>
>> Are there any real advantages to using the new elastic block store? Is
>> moving data from the elastic block store into HDFS any faster than
>> doing it from S3? Or can HDFS essentially live inside of the elastic
>> block store?
>
> Bandwidth between EBS and EC2 is better than between S3 and EC2, so if
> you intend to run MapReduce on your data then you might consider
> running an elastic Hadoop cluster that stores data on EBS-backed HDFS.
> The nice thing is that you can shut down the cluster when you're not
> using it and then restart it later. But if you have other applications
> that need to access data from S3, then this may not be appropriate.
> Also, it may not be as fast as HDFS using local disks for storage.
>
> This is a new area, and I haven't done any measurements, so a lot of
> this is conjecture on my part. Hadoop on EBS doesn't exist yet - but
> it looks like a natural fit.
>
>>
>> Thanks!
>>
>> Ryan
>>
>>
>> On Wed, Sep 3, 2008 at 9:54 AM, Tom White <[EMAIL PROTECTED]> wrote:
>>> There's a case study with some numbers in it from a presentation I
>>> gave on Hadoop and AWS in London last month, which you may find
>>> interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.
>>>
>>> tim robertson <[EMAIL PROTECTED]> wrote:
>>>> For these small
>>>> datasets, you might find it useful - let me know if I should spend
>>>> time finishing it (Or submit help?) - it is really very simple.
>>>
>>> This sounds very useful. Please consider creating a Jira and
>>> submitting the code (even if it's not "finished" folks might like to
>>> see it). Thanks.
>>>
>>> Tom
>>>
>>>>
>>>> Cheers
>>>>
>>>> Tim
>>>>
>>>>
>>>>
>>>> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>>>> Hi Tim,
>>>>>
>>>>> Are you mostly just processing/parsing textual log files? How many
>>>>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>>>>> many did you configure in your JobConf? Just trying to get an idea of
>>>>> what to expect in terms of performance. I'm noticing that it takes
>>>>> about 16 minutes to transfer about 15GB of textual uncompressed data
>>>>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>>>>> expecting this to take a shorter amount of time, but maybe I'm
>>>>> incorrect in my assumptions. I am also noticing that it takes about 15
>>>>> minutes to parse through the 15GB of data with a 15 node cluster.
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote:
>>>>>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>>>>>> nodes and really only in exploration and testing

Re: Hadoop & EC2

2008-09-03 Thread Tom White

On Wed, Sep 3, 2008 at 3:05 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
> Tom,
>
> I noticed that you mentioned using Amazon's new elastic block store as
> an alternative to using S3. Right now I'm testing pushing data to S3,
> then moving it from S3 into HDFS once the Hadoop cluster is up and
> running in EC2. It works pretty well -- moving data from S3 to HDFS is
> fast when the data in S3 is broken up into multiple files, since
> bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the
> data.

Yes, this is a good-enough solution for many applications.

>
> Are there any real advantages to using the new elastic block store? Is
> moving data from the elastic block store into HDFS any faster than
> doing it from S3? Or can HDFS essentially live inside of the elastic
> block store?

Bandwidth between EBS and EC2 is better than between S3 and EC2, so if
you intend to run MapReduce on your data then you might consider
running an elastic Hadoop cluster that stores data on EBS-backed HDFS.
The nice thing is that you can shut down the cluster when you're not
using it and then restart it later. But if you have other applications
that need to access data from S3, then this may not be appropriate.
Also, it may not be as fast as HDFS using local disks for storage.

This is a new area, and I haven't done any measurements, so a lot of
this is conjecture on my part. Hadoop on EBS doesn't exist yet - but
it looks like a natural fit.

>
> Thanks!
>
> Ryan
>
>
> On Wed, Sep 3, 2008 at 9:54 AM, Tom White <[EMAIL PROTECTED]> wrote:
>> There's a case study with some numbers in it from a presentation I
>> gave on Hadoop and AWS in London last month, which you may find
>> interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.
>>
>> tim robertson <[EMAIL PROTECTED]> wrote:
>>> For these small
>>> datasets, you might find it useful - let me know if I should spend
>>> time finishing it (Or submit help?) - it is really very simple.
>>
>> This sounds very useful. Please consider creating a Jira and
>> submitting the code (even if it's not "finished" folks might like to
>> see it). Thanks.
>>
>> Tom
>>
>>>
>>> Cheers
>>>
>>> Tim
>>>
>>>
>>>
>>> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>>> Hi Tim,
>>>>
>>>> Are you mostly just processing/parsing textual log files? How many
>>>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>>>> many did you configure in your JobConf? Just trying to get an idea of
>>>> what to expect in terms of performance. I'm noticing that it takes
>>>> about 16 minutes to transfer about 15GB of textual uncompressed data
>>>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>>>> expecting this to take a shorter amount of time, but maybe I'm
>>>> incorrect in my assumptions. I am also noticing that it takes about 15
>>>> minutes to parse through the 15GB of data with a 15 node cluster.
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote:
>>>>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>>>>> nodes and really only in exploration and testing phase right now.
>>>>>
>>>>>
>>>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> 
>>>>> wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> Just a heads up, if you require more than the 20 node limit, Amazon
>>>>>> provides a form to request a higher limit:
>>>>>>
>>>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>>>>>> Hello all,
>>>>>>>
>>>>>>> I'm curious to see how many people are using EC2 to execute their
>>>>>>> Hadoop cluster and map/reduce programs, and how many are using
>>>>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a
>>>>>>> bit crippling when one wants to process many gigabytes of data. Has
>>>>>>> anyone found this to be the case? How much data are people processing
>>>>>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop & EC2

2008-09-03 Thread tim robertson

Will do Tom... I am about to go on vacation for 3 weeks, so don't
expect anything super soon.
It is nothing too get excited about but is enough to get people into
the concepts and thinking of MR and running quickly in the IDE.

Cheers

Tim

On Wed, Sep 3, 2008 at 3:54 PM, Tom White <[EMAIL PROTECTED]> wrote:
> There's a case study with some numbers in it from a presentation I
> gave on Hadoop and AWS in London last month, which you may find
> interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.
>
> tim robertson <[EMAIL PROTECTED]> wrote:
>> For these small
>> datasets, you might find it useful - let me know if I should spend
>> time finishing it (Or submit help?) - it is really very simple.
>
> This sounds very useful. Please consider creating a Jira and
> submitting the code (even if it's not "finished" folks might like to
> see it). Thanks.
>
> Tom
>
>>
>> Cheers
>>
>> Tim
>>
>>
>>
>> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>> Hi Tim,
>>>
>>> Are you mostly just processing/parsing textual log files? How many
>>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>>> many did you configure in your JobConf? Just trying to get an idea of
>>> what to expect in terms of performance. I'm noticing that it takes
>>> about 16 minutes to transfer about 15GB of textual uncompressed data
>>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>>> expecting this to take a shorter amount of time, but maybe I'm
>>> incorrect in my assumptions. I am also noticing that it takes about 15
>>> minutes to parse through the 15GB of data with a 15 node cluster.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote:
>>>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>>>> nodes and really only in exploration and testing phase right now.
>>>>
>>>>
>>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote:
>>>>> Hi Ryan,
>>>>>
>>>>> Just a heads up, if you require more than the 20 node limit, Amazon
>>>>> provides a form to request a higher limit:
>>>>>
>>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>>>>> Hello all,
>>>>>>
>>>>>> I'm curious to see how many people are using EC2 to execute their
>>>>>> Hadoop cluster and map/reduce programs, and how many are using
>>>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a
>>>>>> bit crippling when one wants to process many gigabytes of data. Has
>>>>>> anyone found this to be the case? How much data are people processing
>>>>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop & EC2

2008-09-03 Thread Ryan LeCompte

Tom,

I noticed that you mentioned using Amazon's new elastic block store as
an alternative to using S3. Right now I'm testing pushing data to S3,
then moving it from S3 into HDFS once the Hadoop cluster is up and
running in EC2. It works pretty well -- moving data from S3 to HDFS is
fast when the data in S3 is broken up into multiple files, since
bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the
data.

Are there any real advantages to using the new elastic block store? Is
moving data from the elastic block store into HDFS any faster than
doing it from S3? Or can HDFS essentially live inside of the elastic
block store?

Thanks!

Ryan


On Wed, Sep 3, 2008 at 9:54 AM, Tom White <[EMAIL PROTECTED]> wrote:
> There's a case study with some numbers in it from a presentation I
> gave on Hadoop and AWS in London last month, which you may find
> interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.
>
> tim robertson <[EMAIL PROTECTED]> wrote:
>> For these small
>> datasets, you might find it useful - let me know if I should spend
>> time finishing it (Or submit help?) - it is really very simple.
>
> This sounds very useful. Please consider creating a Jira and
> submitting the code (even if it's not "finished" folks might like to
> see it). Thanks.
>
> Tom
>
>>
>> Cheers
>>
>> Tim
>>
>>
>>
>> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>> Hi Tim,
>>>
>>> Are you mostly just processing/parsing textual log files? How many
>>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>>> many did you configure in your JobConf? Just trying to get an idea of
>>> what to expect in terms of performance. I'm noticing that it takes
>>> about 16 minutes to transfer about 15GB of textual uncompressed data
>>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>>> expecting this to take a shorter amount of time, but maybe I'm
>>> incorrect in my assumptions. I am also noticing that it takes about 15
>>> minutes to parse through the 15GB of data with a 15 node cluster.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote:
>>>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>>>> nodes and really only in exploration and testing phase right now.
>>>>
>>>>
>>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote:
>>>>> Hi Ryan,
>>>>>
>>>>> Just a heads up, if you require more than the 20 node limit, Amazon
>>>>> provides a form to request a higher limit:
>>>>>
>>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>>>>> Hello all,
>>>>>>
>>>>>> I'm curious to see how many people are using EC2 to execute their
>>>>>> Hadoop cluster and map/reduce programs, and how many are using
>>>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a
>>>>>> bit crippling when one wants to process many gigabytes of data. Has
>>>>>> anyone found this to be the case? How much data are people processing
>>>>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hadoop & EC2

2008-09-03 Thread Tom White

There's a case study with some numbers in it from a presentation I
gave on Hadoop and AWS in London last month, which you may find
interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.

tim robertson <[EMAIL PROTECTED]> wrote:
> For these small
> datasets, you might find it useful - let me know if I should spend
> time finishing it (Or submit help?) - it is really very simple.

This sounds very useful. Please consider creating a Jira and
submitting the code (even if it's not "finished" folks might like to
see it). Thanks.

Tom

>
> Cheers
>
> Tim
>
>
>
> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>> Hi Tim,
>>
>> Are you mostly just processing/parsing textual log files? How many
>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>> many did you configure in your JobConf? Just trying to get an idea of
>> what to expect in terms of performance. I'm noticing that it takes
>> about 16 minutes to transfer about 15GB of textual uncompressed data
>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>> expecting this to take a shorter amount of time, but maybe I'm
>> incorrect in my assumptions. I am also noticing that it takes about 15
>> minutes to parse through the 15GB of data with a 15 node cluster.
>>
>> Thanks,
>> Ryan
>>
>>
>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote:
>>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>>> nodes and really only in exploration and testing phase right now.
>>>
>>>
>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote:
>>>> Hi Ryan,
>>>>
>>>> Just a heads up, if you require more than the 20 node limit, Amazon
>>>> provides a form to request a higher limit:
>>>>
>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>>>> Hello all,
>>>>>
>>>>> I'm curious to see how many people are using EC2 to execute their
>>>>> Hadoop cluster and map/reduce programs, and how many are using
>>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a
>>>>> bit crippling when one wants to process many gigabytes of data. Has
>>>>> anyone found this to be the case? How much data are people processing
>>>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>
>>>
>>
>

Re: Hadoop & EC2

2008-09-02 Thread Russell Smith

I assume that Karl means 'regions' - i.e. Europe or US. I don't think S3 
has the same premise of availability zones that EC2 has.


Between different regions, data transfer is 1) charged for and 2) likely 
slower between EC2 and S3-Europe.


Transfer between S3-US and EC2 is free of charge, and should be 
significantly quicker.



Russell

Ryan LeCompte wrote:

How can you ensure that the S3 buckets and EC2 instances belong to a
certain zone?

Ryan


On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson <[EMAIL PROTECTED]> wrote:
  

On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:



Hi Tim,

Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to transfer about 15GB of textual uncompressed data
from S3 into HDFS after the cluster has started with 15 nodes. I was
expecting this to take a shorter amount of time, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.
  

I'm seeing much faster speeds.  With 128 nodes running a mapper-only
downloading job, downloading 30 GB takes roughly a minute, less time than
the end of job work (which I assume is HDFS replication and bookkeeping).
 More mappers gives you more parallel downloads, of course.  I'm using a
Python REST client for S3, and only move data to or from S3 when Hadoop is
done with it.

Make sure your S3 buckets and EC2 instances are in the same zone.

Re: Hadoop & EC2

2008-09-02 Thread Ryan LeCompte

How can you ensure that the S3 buckets and EC2 instances belong to a
certain zone?

Ryan


On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson <[EMAIL PROTECTED]> wrote:
>
> On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:
>
>> Hi Tim,
>>
>> Are you mostly just processing/parsing textual log files? How many
>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>> many did you configure in your JobConf? Just trying to get an idea of
>> what to expect in terms of performance. I'm noticing that it takes
>> about 16 minutes to transfer about 15GB of textual uncompressed data
>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>> expecting this to take a shorter amount of time, but maybe I'm
>> incorrect in my assumptions. I am also noticing that it takes about 15
>> minutes to parse through the 15GB of data with a 15 node cluster.
>
> I'm seeing much faster speeds.  With 128 nodes running a mapper-only
> downloading job, downloading 30 GB takes roughly a minute, less time than
> the end of job work (which I assume is HDFS replication and bookkeeping).
>  More mappers gives you more parallel downloads, of course.  I'm using a
> Python REST client for S3, and only move data to or from S3 when Hadoop is
> done with it.
>
> Make sure your S3 buckets and EC2 instances are in the same zone.
>
>

Re: Hadoop & EC2

2008-09-02 Thread Karl Anderson



On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:


Hi Tim,

Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to transfer about 15GB of textual uncompressed data
from S3 into HDFS after the cluster has started with 15 nodes. I was
expecting this to take a shorter amount of time, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.


I'm seeing much faster speeds.  With 128 nodes running a mapper-only  
downloading job, downloading 30 GB takes roughly a minute, less time  
than the end of job work (which I assume is HDFS replication and  
bookkeeping).  More mappers gives you more parallel downloads, of  
course.  I'm using a Python REST client for S3, and only move data to  
or from S3 when Hadoop is done with it.


Make sure your S3 buckets and EC2 instances are in the same zone.

Re: Hadoop & EC2

2008-09-02 Thread Michael Stoppelman

Tom White's blog has a nice piece on the different setups you can have for a
hadoop cluster on EC2:
http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html

With the EBS volumes you can bring up and take down your cluster at will so
you don't need to have 20 machines running all the time. We're still
collecting performance numbers, but it's definitely faster to use EBS or
local storage on EC2 than it is to use S3 (we were seeing 2Mb/s - 10Mb/s).

M

On Tue, Sep 2, 2008 at 8:59 AM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> tim robertson wrote:
>
>  Incidentally, I have most of the basics of a "MapReduce-Lite" which I
>> aim to port to use the exact Hadoop API since I am *only* working on
>> 10's-100's GB of data and find that it is running really fine on my
>> laptop and I don't need the distributed failover.  My goal for that
>>
>
> If it's going to be API-compatible with regular Hadoop, then I'm sure many
> people will find it useful. E.g. many Nutch users bemoan the complexity of
> distributed Hadoop setup, and they are not satisfied with the "local"
> single-threaded physical-copy execution mode.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Hadoop & EC2

2008-09-02 Thread Andrzej Bialecki


tim robertson wrote:


Incidentally, I have most of the basics of a "MapReduce-Lite" which I
aim to port to use the exact Hadoop API since I am *only* working on
10's-100's GB of data and find that it is running really fine on my
laptop and I don't need the distributed failover.  My goal for that


If it's going to be API-compatible with regular Hadoop, then I'm sure 
many people will find it useful. E.g. many Nutch users bemoan the 
complexity of distributed Hadoop setup, and they are not satisfied with 
the "local" single-threaded physical-copy execution mode.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Hadoop & EC2

2008-09-02 Thread Ryan LeCompte

Hi Tim,

Thanks for responding -- I believe that I'll need the full power of
Hadoop since I'll want this to scale well beyond 100GB of data. Thanks
for sharing your experiences -- I'll definitely check out your blog.

Thanks!

Ryan


On Tue, Sep 2, 2008 at 8:47 AM, tim robertson <[EMAIL PROTECTED]> wrote:
> Hi Ryan,
>
> I actually blogged my experience as it was my first usage of EC2:
> http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html
>
> My input data was not log files but actually a dump if 150million
> records from Mysql into about 13 columns of tab file data I believe.
> It was a couple of months ago, but I remember thinking S3 was very slow...
>
> I ran some simple operations like distinct values of one column based
> on another (species within a cell) and also did some Polygon analysis
> since to do "is this point in this polygon" does not really scale too
> well in PostGIS.
>
> Incidentally, I have most of the basics of a "MapReduce-Lite" which I
> aim to port to use the exact Hadoop API since I am *only* working on
> 10's-100's GB of data and find that it is running really fine on my
> laptop and I don't need the distributed failover.  My goal for that
> code is for people like me who want to know that I can scale to
> terrabyte processing, but don't need to take the plunge to full Hadoop
> deployment yet, but will know that I can migrate the processing in the
> future as  things grow.  It runs on the normal filesystem, and single
> node only (e.g. multithreaded), and performs very quickly since it is
> just doing java NIO bytebuffers in parallel on the underlying
> filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a
> seconds (simplest of simple map operations).  For these small
> datasets, you might find it useful - let me know if I should spend
> time finishing it (Or submit help?) - it is really very simple.
>
> Cheers
>
> Tim
>
>
>
> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>> Hi Tim,
>>
>> Are you mostly just processing/parsing textual log files? How many
>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>> many did you configure in your JobConf? Just trying to get an idea of
>> what to expect in terms of performance. I'm noticing that it takes
>> about 16 minutes to transfer about 15GB of textual uncompressed data
>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>> expecting this to take a shorter amount of time, but maybe I'm
>> incorrect in my assumptions. I am also noticing that it takes about 15
>> minutes to parse through the 15GB of data with a 15 node cluster.
>>
>> Thanks,
>> Ryan
>>
>>
>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote:
>>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>>> nodes and really only in exploration and testing phase right now.
>>>
>>>
>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote:
>>>> Hi Ryan,
>>>>
>>>> Just a heads up, if you require more than the 20 node limit, Amazon
>>>> provides a form to request a higher limit:
>>>>
>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>>>> Hello all,
>>>>>
>>>>> I'm curious to see how many people are using EC2 to execute their
>>>>> Hadoop cluster and map/reduce programs, and how many are using
>>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a
>>>>> bit crippling when one wants to process many gigabytes of data. Has
>>>>> anyone found this to be the case? How much data are people processing
>>>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>
>>>
>>
>

Re: Hadoop & EC2

2008-09-02 Thread tim robertson

Hi Ryan,

I actually blogged my experience as it was my first usage of EC2:
http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html

My input data was not log files but actually a dump if 150million
records from Mysql into about 13 columns of tab file data I believe.
It was a couple of months ago, but I remember thinking S3 was very slow...

I ran some simple operations like distinct values of one column based
on another (species within a cell) and also did some Polygon analysis
since to do "is this point in this polygon" does not really scale too
well in PostGIS.

Incidentally, I have most of the basics of a "MapReduce-Lite" which I
aim to port to use the exact Hadoop API since I am *only* working on
10's-100's GB of data and find that it is running really fine on my
laptop and I don't need the distributed failover.  My goal for that
code is for people like me who want to know that I can scale to
terrabyte processing, but don't need to take the plunge to full Hadoop
deployment yet, but will know that I can migrate the processing in the
future as  things grow.  It runs on the normal filesystem, and single
node only (e.g. multithreaded), and performs very quickly since it is
just doing java NIO bytebuffers in parallel on the underlying
filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a
seconds (simplest of simple map operations).  For these small
datasets, you might find it useful - let me know if I should spend
time finishing it (Or submit help?) - it is really very simple.

Cheers

Tim

On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
> Hi Tim,
>
> Are you mostly just processing/parsing textual log files? How many
> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
> many did you configure in your JobConf? Just trying to get an idea of
> what to expect in terms of performance. I'm noticing that it takes
> about 16 minutes to transfer about 15GB of textual uncompressed data
> from S3 into HDFS after the cluster has started with 15 nodes. I was
> expecting this to take a shorter amount of time, but maybe I'm
> incorrect in my assumptions. I am also noticing that it takes about 15
> minutes to parse through the 15GB of data with a 15 node cluster.
>
> Thanks,
> Ryan
>
>
> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote:
>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>> nodes and really only in exploration and testing phase right now.
>>
>>
>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote:
>>> Hi Ryan,
>>>
>>> Just a heads up, if you require more than the 20 node limit, Amazon
>>> provides a form to request a higher limit:
>>>
>>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>>
>>> Andrew
>>>
>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>>> Hello all,
>>>>
>>>> I'm curious to see how many people are using EC2 to execute their
>>>> Hadoop cluster and map/reduce programs, and how many are using
>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a
>>>> bit crippling when one wants to process many gigabytes of data. Has
>>>> anyone found this to be the case? How much data are people processing
>>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>
>>
>

Re: Hadoop & EC2

2008-09-02 Thread Ryan LeCompte

Hi Tim,

Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to transfer about 15GB of textual uncompressed data
from S3 into HDFS after the cluster has started with 15 nodes. I was
expecting this to take a shorter amount of time, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.

Thanks,
Ryan

On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <[EMAIL PROTECTED]> wrote:
> I have been processing only 100s GBs on EC2, not 1000's and using 20
> nodes and really only in exploration and testing phase right now.
>
>
> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote:
>> Hi Ryan,
>>
>> Just a heads up, if you require more than the 20 node limit, Amazon
>> provides a form to request a higher limit:
>>
>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>
>> Andrew
>>
>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>>> Hello all,
>>>
>>> I'm curious to see how many people are using EC2 to execute their
>>> Hadoop cluster and map/reduce programs, and how many are using
>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a
>>> bit crippling when one wants to process many gigabytes of data. Has
>>> anyone found this to be the case? How much data are people processing
>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>
>>> Thanks,
>>> Ryan
>>>
>>
>

Re: Hadoop & EC2

2008-09-02 Thread tim robertson

I have been processing only 100s GBs on EC2, not 1000's and using 20
nodes and really only in exploration and testing phase right now.


On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote:
> Hi Ryan,
>
> Just a heads up, if you require more than the 20 node limit, Amazon
> provides a form to request a higher limit:
>
> http://www.amazon.com/gp/html-forms-controller/ec2-request
>
> Andrew
>
> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>> Hello all,
>>
>> I'm curious to see how many people are using EC2 to execute their
>> Hadoop cluster and map/reduce programs, and how many are using
>> home-grown datacenters. It seems like the 20 node limit with EC2 is a
>> bit crippling when one wants to process many gigabytes of data. Has
>> anyone found this to be the case? How much data are people processing
>> with their 20 node limit on EC2? Curious what the thoughts are...
>>
>> Thanks,
>> Ryan
>>
>

Re: Hadoop & EC2

2008-09-01 Thread Andrew Hitchcock

Hi Ryan,

Just a heads up, if you require more than the 20 node limit, Amazon
provides a form to request a higher limit:

http://www.amazon.com/gp/html-forms-controller/ec2-request

Andrew

On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
> Hello all,
>
> I'm curious to see how many people are using EC2 to execute their
> Hadoop cluster and map/reduce programs, and how many are using
> home-grown datacenters. It seems like the 20 node limit with EC2 is a
> bit crippling when one wants to process many gigabytes of data. Has
> anyone found this to be the case? How much data are people processing
> with their 20 node limit on EC2? Curious what the thoughts are...
>
> Thanks,
> Ryan
>

Hadoop & EC2

2008-09-01 Thread Ryan LeCompte

Hello all,

I'm curious to see how many people are using EC2 to execute their
Hadoop cluster and map/reduce programs, and how many are using
home-grown datacenters. It seems like the 20 node limit with EC2 is a
bit crippling when one wants to process many gigabytes of data. Has
anyone found this to be the case? How much data are people processing
with their 20 node limit on EC2? Curious what the thoughts are...

Thanks,
Ryan

hadoop-ec2 log access

2008-07-21 Thread Karl Anderson

I'm unable to access my logs with the JobTracker/TaskTracker web  
interface for a Hadoop job running on Amazon EC2.  The URLs given for  
the task logs are of the form:


  http://domu-[...].compute-1.internal:50060/

The Hadoop-EC2 docs suggest that I should be able to get onto port  
50060 for the master and the task boxes, is there a way to reach the  
logs?  Maybe by finding out what IP address to use?  Or is there a way  
to see the logs on the master?  When I run pseudo-distributed, the  
logs show up in the logs/userlogs subdirectory of the Hadoop root, but  
not on my EC2 instances.


I'm running a streaming job, so I need to be able to look at the  
stderr of my tasks.


Thanks for any help.

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Re: Hadoop & EC2

Hadoop & EC2

hadoop-ec2 log access

19 matches

Site Navigation

Mail list logo

Footer information