Re: Hadoop EC2

2008-09-04 Thread Tom White
On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 I'm noticing that using bin/hadoop fs -put ... svn://... is uploading
 multi-gigabyte files in ~64MB chunks.

That's because S3Filesystem stores files as 64MB blocks on S3.

 Then, when this is copied from
 S3 into HDFS using bin/hadoop distcp. Once the files are there and the
 job begins, it looks like it's breaking up the 4 multigigabyte text
 files into about 225 maps. Does this mean that each map is roughly
 processing 64MB of data each?

Yes, HDFS stores files as 64MB blocks too, and map input is split by
default so each map processes one block.

If so, is there any way to change this
 so that I can get my map tasks to process more data at a time? I'm
 curious if this will shorten the time it takes to run the program.

You could try increasing the HDFS block size. 128MB is actually
usually a better value, for this very reason.

In the future https://issues.apache.org/jira/browse/HADOOP-2560 will
help here too.


 Tom, in your article about Hadoop + EC2 you mention processing about
 100GB of logs in under 6 minutes or so.

In this article:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873,
it took 35 minutes to run the job. I'm planning on doing some
benchmarking on EC2 fairly soon, which should help us improve the
performance of Hadoop on EC2. It's worth remarking that this was
running on small instances. The larger instances perform a lot better
in my experience.

 Do you remember how many EC2
 instances you had running, and also how many map tasks did you have to
 operate on the 100GB? Was each map task handling about 1GB each?

I was running 20 nodes, and each map task was handling a HDFS block, 64MB.

Hope this helps,

Tom


Re: Hadoop EC2

2008-09-04 Thread Ryan LeCompte
Hi Tom,

This clears up my questions.

Thanks!

Ryan



On Thu, Sep 4, 2008 at 9:21 AM, Tom White [EMAIL PROTECTED] wrote:
 On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 I'm noticing that using bin/hadoop fs -put ... svn://... is uploading
 multi-gigabyte files in ~64MB chunks.

 That's because S3Filesystem stores files as 64MB blocks on S3.

 Then, when this is copied from
 S3 into HDFS using bin/hadoop distcp. Once the files are there and the
 job begins, it looks like it's breaking up the 4 multigigabyte text
 files into about 225 maps. Does this mean that each map is roughly
 processing 64MB of data each?

 Yes, HDFS stores files as 64MB blocks too, and map input is split by
 default so each map processes one block.

If so, is there any way to change this
 so that I can get my map tasks to process more data at a time? I'm
 curious if this will shorten the time it takes to run the program.

 You could try increasing the HDFS block size. 128MB is actually
 usually a better value, for this very reason.

 In the future https://issues.apache.org/jira/browse/HADOOP-2560 will
 help here too.


 Tom, in your article about Hadoop + EC2 you mention processing about
 100GB of logs in under 6 minutes or so.

 In this article:
 http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873,
 it took 35 minutes to run the job. I'm planning on doing some
 benchmarking on EC2 fairly soon, which should help us improve the
 performance of Hadoop on EC2. It's worth remarking that this was
 running on small instances. The larger instances perform a lot better
 in my experience.

 Do you remember how many EC2
 instances you had running, and also how many map tasks did you have to
 operate on the 100GB? Was each map task handling about 1GB each?

 I was running 20 nodes, and each map task was handling a HDFS block, 64MB.

 Hope this helps,

 Tom



Re: Hadoop EC2

2008-09-03 Thread Tom White
There's a case study with some numbers in it from a presentation I
gave on Hadoop and AWS in London last month, which you may find
interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.

tim robertson [EMAIL PROTECTED] wrote:
 For these small
 datasets, you might find it useful - let me know if I should spend
 time finishing it (Or submit help?) - it is really very simple.

This sounds very useful. Please consider creating a Jira and
submitting the code (even if it's not finished folks might like to
see it). Thanks.

Tom


 Cheers

 Tim



 On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan







Re: Hadoop EC2

2008-09-03 Thread Ryan LeCompte
Tom,

I noticed that you mentioned using Amazon's new elastic block store as
an alternative to using S3. Right now I'm testing pushing data to S3,
then moving it from S3 into HDFS once the Hadoop cluster is up and
running in EC2. It works pretty well -- moving data from S3 to HDFS is
fast when the data in S3 is broken up into multiple files, since
bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the
data.

Are there any real advantages to using the new elastic block store? Is
moving data from the elastic block store into HDFS any faster than
doing it from S3? Or can HDFS essentially live inside of the elastic
block store?

Thanks!

Ryan


On Wed, Sep 3, 2008 at 9:54 AM, Tom White [EMAIL PROTECTED] wrote:
 There's a case study with some numbers in it from a presentation I
 gave on Hadoop and AWS in London last month, which you may find
 interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.

 tim robertson [EMAIL PROTECTED] wrote:
 For these small
 datasets, you might find it useful - let me know if I should spend
 time finishing it (Or submit help?) - it is really very simple.

 This sounds very useful. Please consider creating a Jira and
 submitting the code (even if it's not finished folks might like to
 see it). Thanks.

 Tom


 Cheers

 Tim



 On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan








Re: Hadoop EC2

2008-09-03 Thread tim robertson
Will do Tom... I am about to go on vacation for 3 weeks, so don't
expect anything super soon.
It is nothing too get excited about but is enough to get people into
the concepts and thinking of MR and running quickly in the IDE.

Cheers

Tim

On Wed, Sep 3, 2008 at 3:54 PM, Tom White [EMAIL PROTECTED] wrote:
 There's a case study with some numbers in it from a presentation I
 gave on Hadoop and AWS in London last month, which you may find
 interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.

 tim robertson [EMAIL PROTECTED] wrote:
 For these small
 datasets, you might find it useful - let me know if I should spend
 time finishing it (Or submit help?) - it is really very simple.

 This sounds very useful. Please consider creating a Jira and
 submitting the code (even if it's not finished folks might like to
 see it). Thanks.

 Tom


 Cheers

 Tim



 On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan








Re: Hadoop EC2

2008-09-03 Thread Tom White
On Wed, Sep 3, 2008 at 3:05 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Tom,

 I noticed that you mentioned using Amazon's new elastic block store as
 an alternative to using S3. Right now I'm testing pushing data to S3,
 then moving it from S3 into HDFS once the Hadoop cluster is up and
 running in EC2. It works pretty well -- moving data from S3 to HDFS is
 fast when the data in S3 is broken up into multiple files, since
 bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the
 data.

Yes, this is a good-enough solution for many applications.


 Are there any real advantages to using the new elastic block store? Is
 moving data from the elastic block store into HDFS any faster than
 doing it from S3? Or can HDFS essentially live inside of the elastic
 block store?

Bandwidth between EBS and EC2 is better than between S3 and EC2, so if
you intend to run MapReduce on your data then you might consider
running an elastic Hadoop cluster that stores data on EBS-backed HDFS.
The nice thing is that you can shut down the cluster when you're not
using it and then restart it later. But if you have other applications
that need to access data from S3, then this may not be appropriate.
Also, it may not be as fast as HDFS using local disks for storage.

This is a new area, and I haven't done any measurements, so a lot of
this is conjecture on my part. Hadoop on EBS doesn't exist yet - but
it looks like a natural fit.


 Thanks!

 Ryan


 On Wed, Sep 3, 2008 at 9:54 AM, Tom White [EMAIL PROTECTED] wrote:
 There's a case study with some numbers in it from a presentation I
 gave on Hadoop and AWS in London last month, which you may find
 interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.

 tim robertson [EMAIL PROTECTED] wrote:
 For these small
 datasets, you might find it useful - let me know if I should spend
 time finishing it (Or submit help?) - it is really very simple.

 This sounds very useful. Please consider creating a Jira and
 submitting the code (even if it's not finished folks might like to
 see it). Thanks.

 Tom


 Cheers

 Tim



 On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] 
 wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan









Re: Hadoop EC2

2008-09-02 Thread Andrew Hitchcock
Hi Ryan,

Just a heads up, if you require more than the 20 node limit, Amazon
provides a form to request a higher limit:

http://www.amazon.com/gp/html-forms-controller/ec2-request

Andrew

On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan



Re: Hadoop EC2

2008-09-02 Thread tim robertson
I have been processing only 100s GBs on EC2, not 1000's and using 20
nodes and really only in exploration and testing phase right now.


On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan




Re: Hadoop EC2

2008-09-02 Thread Ryan LeCompte
Hi Tim,

Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to transfer about 15GB of textual uncompressed data
from S3 into HDFS after the cluster has started with 15 nodes. I was
expecting this to take a shorter amount of time, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.

Thanks,
Ryan


On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan





Re: Hadoop EC2

2008-09-02 Thread tim robertson
Hi Ryan,

I actually blogged my experience as it was my first usage of EC2:
http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html

My input data was not log files but actually a dump if 150million
records from Mysql into about 13 columns of tab file data I believe.
It was a couple of months ago, but I remember thinking S3 was very slow...

I ran some simple operations like distinct values of one column based
on another (species within a cell) and also did some Polygon analysis
since to do is this point in this polygon does not really scale too
well in PostGIS.

Incidentally, I have most of the basics of a MapReduce-Lite which I
aim to port to use the exact Hadoop API since I am *only* working on
10's-100's GB of data and find that it is running really fine on my
laptop and I don't need the distributed failover.  My goal for that
code is for people like me who want to know that I can scale to
terrabyte processing, but don't need to take the plunge to full Hadoop
deployment yet, but will know that I can migrate the processing in the
future as  things grow.  It runs on the normal filesystem, and single
node only (e.g. multithreaded), and performs very quickly since it is
just doing java NIO bytebuffers in parallel on the underlying
filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a
seconds (simplest of simple map operations).  For these small
datasets, you might find it useful - let me know if I should spend
time finishing it (Or submit help?) - it is really very simple.

Cheers

Tim



On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan






Re: Hadoop EC2

2008-09-02 Thread Ryan LeCompte
Hi Tim,

Thanks for responding -- I believe that I'll need the full power of
Hadoop since I'll want this to scale well beyond 100GB of data. Thanks
for sharing your experiences -- I'll definitely check out your blog.

Thanks!

Ryan


On Tue, Sep 2, 2008 at 8:47 AM, tim robertson [EMAIL PROTECTED] wrote:
 Hi Ryan,

 I actually blogged my experience as it was my first usage of EC2:
 http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html

 My input data was not log files but actually a dump if 150million
 records from Mysql into about 13 columns of tab file data I believe.
 It was a couple of months ago, but I remember thinking S3 was very slow...

 I ran some simple operations like distinct values of one column based
 on another (species within a cell) and also did some Polygon analysis
 since to do is this point in this polygon does not really scale too
 well in PostGIS.

 Incidentally, I have most of the basics of a MapReduce-Lite which I
 aim to port to use the exact Hadoop API since I am *only* working on
 10's-100's GB of data and find that it is running really fine on my
 laptop and I don't need the distributed failover.  My goal for that
 code is for people like me who want to know that I can scale to
 terrabyte processing, but don't need to take the plunge to full Hadoop
 deployment yet, but will know that I can migrate the processing in the
 future as  things grow.  It runs on the normal filesystem, and single
 node only (e.g. multithreaded), and performs very quickly since it is
 just doing java NIO bytebuffers in parallel on the underlying
 filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a
 seconds (simplest of simple map operations).  For these small
 datasets, you might find it useful - let me know if I should spend
 time finishing it (Or submit help?) - it is really very simple.

 Cheers

 Tim



 On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan







Re: Hadoop EC2

2008-09-02 Thread Andrzej Bialecki

tim robertson wrote:


Incidentally, I have most of the basics of a MapReduce-Lite which I
aim to port to use the exact Hadoop API since I am *only* working on
10's-100's GB of data and find that it is running really fine on my
laptop and I don't need the distributed failover.  My goal for that


If it's going to be API-compatible with regular Hadoop, then I'm sure 
many people will find it useful. E.g. many Nutch users bemoan the 
complexity of distributed Hadoop setup, and they are not satisfied with 
the local single-threaded physical-copy execution mode.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Hadoop EC2

2008-09-02 Thread Michael Stoppelman
Tom White's blog has a nice piece on the different setups you can have for a
hadoop cluster on EC2:
http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html

With the EBS volumes you can bring up and take down your cluster at will so
you don't need to have 20 machines running all the time. We're still
collecting performance numbers, but it's definitely faster to use EBS or
local storage on EC2 than it is to use S3 (we were seeing 2Mb/s - 10Mb/s).

M

On Tue, Sep 2, 2008 at 8:59 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 tim robertson wrote:

  Incidentally, I have most of the basics of a MapReduce-Lite which I
 aim to port to use the exact Hadoop API since I am *only* working on
 10's-100's GB of data and find that it is running really fine on my
 laptop and I don't need the distributed failover.  My goal for that


 If it's going to be API-compatible with regular Hadoop, then I'm sure many
 people will find it useful. E.g. many Nutch users bemoan the complexity of
 distributed Hadoop setup, and they are not satisfied with the local
 single-threaded physical-copy execution mode.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




Re: Hadoop EC2

2008-09-02 Thread Ryan LeCompte
How can you ensure that the S3 buckets and EC2 instances belong to a
certain zone?

Ryan


On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson [EMAIL PROTECTED] wrote:

 On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:

 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 I'm seeing much faster speeds.  With 128 nodes running a mapper-only
 downloading job, downloading 30 GB takes roughly a minute, less time than
 the end of job work (which I assume is HDFS replication and bookkeeping).
  More mappers gives you more parallel downloads, of course.  I'm using a
 Python REST client for S3, and only move data to or from S3 when Hadoop is
 done with it.

 Make sure your S3 buckets and EC2 instances are in the same zone.




Re: Hadoop EC2

2008-09-02 Thread Russell Smith
I assume that Karl means 'regions' - i.e. Europe or US. I don't think S3 
has the same premise of availability zones that EC2 has.


Between different regions, data transfer is 1) charged for and 2) likely 
slower between EC2 and S3-Europe.


Transfer between S3-US and EC2 is free of charge, and should be 
significantly quicker.



Russell

Ryan LeCompte wrote:

How can you ensure that the S3 buckets and EC2 instances belong to a
certain zone?

Ryan


On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson [EMAIL PROTECTED] wrote:
  

On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:



Hi Tim,

Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to transfer about 15GB of textual uncompressed data
from S3 into HDFS after the cluster has started with 15 nodes. I was
expecting this to take a shorter amount of time, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.
  

I'm seeing much faster speeds.  With 128 nodes running a mapper-only
downloading job, downloading 30 GB takes roughly a minute, less time than
the end of job work (which I assume is HDFS replication and bookkeeping).
 More mappers gives you more parallel downloads, of course.  I'm using a
Python REST client for S3, and only move data to or from S3 when Hadoop is
done with it.

Make sure your S3 buckets and EC2 instances are in the same zone.







Hadoop EC2

2008-09-01 Thread Ryan LeCompte
Hello all,

I'm curious to see how many people are using EC2 to execute their
Hadoop cluster and map/reduce programs, and how many are using
home-grown datacenters. It seems like the 20 node limit with EC2 is a
bit crippling when one wants to process many gigabytes of data. Has
anyone found this to be the case? How much data are people processing
with their 20 node limit on EC2? Curious what the thoughts are...

Thanks,
Ryan


hadoop-ec2 log access

2008-07-21 Thread Karl Anderson
I'm unable to access my logs with the JobTracker/TaskTracker web  
interface for a Hadoop job running on Amazon EC2.  The URLs given for  
the task logs are of the form:


  http://domu-[...].compute-1.internal:50060/

The Hadoop-EC2 docs suggest that I should be able to get onto port  
50060 for the master and the task boxes, is there a way to reach the  
logs?  Maybe by finding out what IP address to use?  Or is there a way  
to see the logs on the master?  When I run pseudo-distributed, the  
logs show up in the logs/userlogs subdirectory of the Hadoop root, but  
not on my EC2 instances.


I'm running a streaming job, so I need to be able to look at the  
stderr of my tasks.


Thanks for any help.