Re: Hadoop/Elastic MR on AWS

2010-12-29 Thread Sudhir Vallamkondu
 Are there any independent sites that collect cloud uptime numbers?
Not that I know of.

If you look at the full post content people have raised quite a few pros and
cons. We are analyzing the AWS Cloudwatch API and see how we can leverage it
to monitor EMR. EMR is offered in manu of their regions and since we are
planning on using S3 as the raw data store, if one region is experiencing
problems we can always look into killing the job and starting off in another
region. Just a thought.

http://lucene.472066.n3.nabble.com/Hadoop-Elastic-MR-on-AWS-td2058471.html


On 12/28/10 8:01 PM, common-user-digest-h...@hadoop.apache.org
common-user-digest-h...@hadoop.apache.org wrote:

 From: Lance Norskog goks...@gmail.com
 Date: Tue, 28 Dec 2010 18:50:14 -0800
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop/Elastic MR on AWS
 
 Cloud providers have more uptime problems than dedicated servers. And
 it is impossible to benchmark: virtual server implementations do not
 apply quotas to I/O. I've seen the same 'instance size' have 5x deltas
 in disk bandwidth from one day to the next.
 
 Are there any independent sites that collect cloud uptime numbers?
 
 On Tue, Dec 28, 2010 at 5:41 PM, Sudhir Vallamkondu
 sudhir.vallamko...@icrossing.com wrote:
 Unfortunately I can't publish the exact numbers however here are the various
 things we considered
 
 First off our data trends. We gathered our current data size and plotted a
 future growth trend for the next few years. We then finalized on a archival
 strategy to understand how much data needs to be on the cluster on a
 rotating basis. We crunch our data often (meaning as we get them) so
 computing power is not an issue and the cluster size was mainly driven by
 our data size that needs to be readily available and replication strategy.
 We factored in compression use on older rotating data.
 
 Once we had the above numbers we could decide on our cluster infrastructure
 size and type of hardware needed.
 
 For local cluster we factored in hardware, warranty, regular networking
 stuff for cluster that size, data center costs, support manpower. We also
 factored in a NAS and bandwidth costs to replicate cluster data to another
 data center for active replication.
 
 For EMR costs we compared a reserved instance cluster (nodes reserved for
 3years with similar hardware config as above) with above cluster size vs
 nodes on the fly. We factored in S3 costs to store the above calculated
 rotating data and bandwidth costs for data coming in and coming out. One
 thing to note is Amazon EMR costs are above normal EC2 instance costs. For
 example if you run a job in EMR with 4 nodes and the job overall takes 1hr
 then total EMR cost (excluding any data transfer costs) = 4*1*{EMR /hour} +
 4*1*EC2 /hour cost. Hopefully that makes sense.
 
 I am sure missing a few things above but that's the jist of it.
 
 - Sudhir
 
 
 
 
 
 
 On 12/27/10 9:22 PM, common-user-digest-h...@hadoop.apache.org
 common-user-digest-h...@hadoop.apache.org wrote:
 
 From: Dave Viner davevi...@gmail.com
 Date: Mon, 27 Dec 2010 10:23:37 -0800
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop/Elastic MR on AWS
 
 Hi Sudhir,
 
 Can you publish your findings around pricing, and how you calculated the
 various aspects?
 
 This is great information.
 
 Thanks
 Dave Viner
 
 
 On Mon, Dec 27, 2010 at 10:17 AM, Sudhir Vallamkondu 
 sudhir.vallamko...@icrossing.com wrote:
 
 We recently crossed this bridge and here are some insights. We did an
 extensive study comparing costs and benchmarking local vs EMR for our
 current needs and future trend.
 
 - Scalability you get with EMR is unmatched although you need to look at
 your requirement and decide this is something you need.
 
 - When using EMR its cheaper to use reserved instances vs nodes on the fly.
 You can always add more nodes when required. I suggest looking at your
 current computing needs and reserve instances for a year or two and use
 these to run EMR and add nodes at peak needs. In your cost estimation you
 will need to factor in the data transfer time/costs unless you are dealing
 with public datasets on S3
 
 - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to
 benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO
 benchmark). For IO intensive jobs you will need to add more nodes to
 compensate this.
 
 - When compared to local cluster, you will need to factor the time it takes
 for the EMR cluster to setup when starting a job. This like data transfer
 time, cluster replication time etc
 
 - EMR API is very flexible however you will need to build a custom
 interface
 on top of it to suit your job management and monitoring needs
 
 - EMR bootstrap actions can satisfy most of your native lib needs so no
 drawbacks there.
 
 
 -- Sudhir
 
 
 On 12/26/10 5:26 AM, common-user-digest-h...@hadoop.apache.org
 common-user-digest-h...@hadoop.apache.org wrote:
 
 From: Otis Gospodnetic otis_gospodne...@yahoo.com

Re: Hadoop/Elastic MR on AWS

2010-12-28 Thread Sudhir Vallamkondu
Unfortunately I can't publish the exact numbers however here are the various
things we considered

First off our data trends. We gathered our current data size and plotted a
future growth trend for the next few years. We then finalized on a archival
strategy to understand how much data needs to be on the cluster on a
rotating basis. We crunch our data often (meaning as we get them) so
computing power is not an issue and the cluster size was mainly driven by
our data size that needs to be readily available and replication strategy.
We factored in compression use on older rotating data.

Once we had the above numbers we could decide on our cluster infrastructure
size and type of hardware needed.

For local cluster we factored in hardware, warranty, regular networking
stuff for cluster that size, data center costs, support manpower. We also
factored in a NAS and bandwidth costs to replicate cluster data to another
data center for active replication.

For EMR costs we compared a reserved instance cluster (nodes reserved for
3years with similar hardware config as above) with above cluster size vs
nodes on the fly. We factored in S3 costs to store the above calculated
rotating data and bandwidth costs for data coming in and coming out. One
thing to note is Amazon EMR costs are above normal EC2 instance costs. For
example if you run a job in EMR with 4 nodes and the job overall takes 1hr
then total EMR cost (excluding any data transfer costs) = 4*1*{EMR /hour} +
4*1*EC2 /hour cost. Hopefully that makes sense.

I am sure missing a few things above but that's the jist of it.

- Sudhir

  




On 12/27/10 9:22 PM, common-user-digest-h...@hadoop.apache.org
common-user-digest-h...@hadoop.apache.org wrote:

 From: Dave Viner davevi...@gmail.com
 Date: Mon, 27 Dec 2010 10:23:37 -0800
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop/Elastic MR on AWS
 
 Hi Sudhir,
 
 Can you publish your findings around pricing, and how you calculated the
 various aspects?
 
 This is great information.
 
 Thanks
 Dave Viner
 
 
 On Mon, Dec 27, 2010 at 10:17 AM, Sudhir Vallamkondu 
 sudhir.vallamko...@icrossing.com wrote:
 
 We recently crossed this bridge and here are some insights. We did an
 extensive study comparing costs and benchmarking local vs EMR for our
 current needs and future trend.
 
 - Scalability you get with EMR is unmatched although you need to look at
 your requirement and decide this is something you need.
 
 - When using EMR its cheaper to use reserved instances vs nodes on the fly.
 You can always add more nodes when required. I suggest looking at your
 current computing needs and reserve instances for a year or two and use
 these to run EMR and add nodes at peak needs. In your cost estimation you
 will need to factor in the data transfer time/costs unless you are dealing
 with public datasets on S3
 
 - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to
 benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO
 benchmark). For IO intensive jobs you will need to add more nodes to
 compensate this.
 
 - When compared to local cluster, you will need to factor the time it takes
 for the EMR cluster to setup when starting a job. This like data transfer
 time, cluster replication time etc
 
 - EMR API is very flexible however you will need to build a custom
 interface
 on top of it to suit your job management and monitoring needs
 
 - EMR bootstrap actions can satisfy most of your native lib needs so no
 drawbacks there.
 
 
 -- Sudhir
 
 
 On 12/26/10 5:26 AM, common-user-digest-h...@hadoop.apache.org
 common-user-digest-h...@hadoop.apache.org wrote:
 
 From: Otis Gospodnetic otis_gospodne...@yahoo.com
 Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST)
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop/Elastic MR on AWS
 
 Hello Amandeep,
 
 
 
 - Original Message 
 From: Amandeep Khurana ama...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Fri, December 10, 2010 1:14:45 AM
 Subject: Re: Hadoop/Elastic MR on AWS
 
 Mark,
 
 Using EMR makes it very easy to start a cluster and add/reduce  capacity
 as
 and when required. There are certain optimizations that make EMR  an
 attractive choice as compared to building your own cluster out. Using
  EMR
 
 
 Could you please point out what optimizations you are referring to?
 
 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
 HBase
 Hadoop ecosystem search :: http://search-hadoop.com/
 
 also ensures you are using a production quality, stable system backed by
  the
 EMR engineers. You can always use bootstrap actions to put your own
  tweaked
 version of Hadoop in there if you want to do that.
 
 Also, you  don't have to tear down your cluster after every job. You can
 set
 the alive  option when you start your cluster and it will stay there
 even
 after your  Hadoop job completes.
 
 If you face any issues with EMR, send me a mail  offline and I'll be
 happy to
 help

Re: Hadoop/Elastic MR on AWS

2010-12-27 Thread Sudhir Vallamkondu
We recently crossed this bridge and here are some insights. We did an
extensive study comparing costs and benchmarking local vs EMR for our
current needs and future trend.

- Scalability you get with EMR is unmatched although you need to look at
your requirement and decide this is something you need.

- When using EMR its cheaper to use reserved instances vs nodes on the fly.
You can always add more nodes when required. I suggest looking at your
current computing needs and reserve instances for a year or two and use
these to run EMR and add nodes at peak needs. In your cost estimation you
will need to factor in the data transfer time/costs unless you are dealing
with public datasets on S3

- EMR fared similar to local cluster on CPU benchmarks (we used MRBench to
benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO
benchmark). For IO intensive jobs you will need to add more nodes to
compensate this.

- When compared to local cluster, you will need to factor the time it takes
for the EMR cluster to setup when starting a job. This like data transfer
time, cluster replication time etc

- EMR API is very flexible however you will need to build a custom interface
on top of it to suit your job management and monitoring needs

- EMR bootstrap actions can satisfy most of your native lib needs so no
drawbacks there.


-- Sudhir


On 12/26/10 5:26 AM, common-user-digest-h...@hadoop.apache.org
common-user-digest-h...@hadoop.apache.org wrote:

 From: Otis Gospodnetic otis_gospodne...@yahoo.com
 Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST)
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop/Elastic MR on AWS
 
 Hello Amandeep,
 
 
 
 - Original Message 
 From: Amandeep Khurana ama...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Fri, December 10, 2010 1:14:45 AM
 Subject: Re: Hadoop/Elastic MR on AWS
 
 Mark,
 
 Using EMR makes it very easy to start a cluster and add/reduce  capacity as
 and when required. There are certain optimizations that make EMR  an
 attractive choice as compared to building your own cluster out. Using  EMR
 
 
 Could you please point out what optimizations you are referring to?
 
 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
 Hadoop ecosystem search :: http://search-hadoop.com/
 
 also ensures you are using a production quality, stable system backed by  the
 EMR engineers. You can always use bootstrap actions to put your own  tweaked
 version of Hadoop in there if you want to do that.
 
 Also, you  don't have to tear down your cluster after every job. You can set
 the alive  option when you start your cluster and it will stay there even
 after your  Hadoop job completes.
 
 If you face any issues with EMR, send me a mail  offline and I'll be happy to
 help.
 
 -Amandeep
 
 
 On Thu, Dec 9,  2010 at 9:47 PM, Mark static.void@gmail.com  wrote:
 
 Does anyone have any thoughts/experiences on running Hadoop  in AWS? What
 are some pros/cons?
 
 Are there any good  AMI's out there for this?
 
 Thanks for any advice.
 
 


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information of iCrossing. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.




Re: Hadoop/Elastic MR on AWS

2010-12-27 Thread Dave Viner
Hi Sudhir,

Can you publish your findings around pricing, and how you calculated the
various aspects?

This is great information.

Thanks
Dave Viner


On Mon, Dec 27, 2010 at 10:17 AM, Sudhir Vallamkondu 
sudhir.vallamko...@icrossing.com wrote:

 We recently crossed this bridge and here are some insights. We did an
 extensive study comparing costs and benchmarking local vs EMR for our
 current needs and future trend.

 - Scalability you get with EMR is unmatched although you need to look at
 your requirement and decide this is something you need.

 - When using EMR its cheaper to use reserved instances vs nodes on the fly.
 You can always add more nodes when required. I suggest looking at your
 current computing needs and reserve instances for a year or two and use
 these to run EMR and add nodes at peak needs. In your cost estimation you
 will need to factor in the data transfer time/costs unless you are dealing
 with public datasets on S3

 - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to
 benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO
 benchmark). For IO intensive jobs you will need to add more nodes to
 compensate this.

 - When compared to local cluster, you will need to factor the time it takes
 for the EMR cluster to setup when starting a job. This like data transfer
 time, cluster replication time etc

 - EMR API is very flexible however you will need to build a custom
 interface
 on top of it to suit your job management and monitoring needs

 - EMR bootstrap actions can satisfy most of your native lib needs so no
 drawbacks there.


 -- Sudhir


 On 12/26/10 5:26 AM, common-user-digest-h...@hadoop.apache.org
 common-user-digest-h...@hadoop.apache.org wrote:

  From: Otis Gospodnetic otis_gospodne...@yahoo.com
  Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST)
  To: common-user@hadoop.apache.org
  Subject: Re: Hadoop/Elastic MR on AWS
 
  Hello Amandeep,
 
 
 
  - Original Message 
  From: Amandeep Khurana ama...@gmail.com
  To: common-user@hadoop.apache.org
  Sent: Fri, December 10, 2010 1:14:45 AM
  Subject: Re: Hadoop/Elastic MR on AWS
 
  Mark,
 
  Using EMR makes it very easy to start a cluster and add/reduce  capacity
 as
  and when required. There are certain optimizations that make EMR  an
  attractive choice as compared to building your own cluster out. Using
  EMR
 
 
  Could you please point out what optimizations you are referring to?
 
  Thanks,
  Otis
  
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
 HBase
  Hadoop ecosystem search :: http://search-hadoop.com/
 
  also ensures you are using a production quality, stable system backed by
  the
  EMR engineers. You can always use bootstrap actions to put your own
  tweaked
  version of Hadoop in there if you want to do that.
 
  Also, you  don't have to tear down your cluster after every job. You can
 set
  the alive  option when you start your cluster and it will stay there
 even
  after your  Hadoop job completes.
 
  If you face any issues with EMR, send me a mail  offline and I'll be
 happy to
  help.
 
  -Amandeep
 
 
  On Thu, Dec 9,  2010 at 9:47 PM, Mark static.void@gmail.com
  wrote:
 
  Does anyone have any thoughts/experiences on running Hadoop  in AWS?
 What
  are some pros/cons?
 
  Are there any good  AMI's out there for this?
 
  Thanks for any advice.
 
 


 iCrossing Privileged and Confidential Information
 This email message is for the sole use of the intended recipient(s) and may
 contain confidential and privileged information of iCrossing. Any
 unauthorized review, use, disclosure or distribution is prohibited. If you
 are not the intended recipient, please contact the sender by reply email and
 destroy all copies of the original message.





Re: Hadoop/Elastic MR on AWS

2010-12-27 Thread James Seigel
Thank you for sharing.

Sent from my mobile. Please excuse the typos.

On 2010-12-27, at 11:18 AM, Sudhir Vallamkondu
sudhir.vallamko...@icrossing.com wrote:

 We recently crossed this bridge and here are some insights. We did an
 extensive study comparing costs and benchmarking local vs EMR for our
 current needs and future trend.

 - Scalability you get with EMR is unmatched although you need to look at
 your requirement and decide this is something you need.

 - When using EMR its cheaper to use reserved instances vs nodes on the fly.
 You can always add more nodes when required. I suggest looking at your
 current computing needs and reserve instances for a year or two and use
 these to run EMR and add nodes at peak needs. In your cost estimation you
 will need to factor in the data transfer time/costs unless you are dealing
 with public datasets on S3

 - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to
 benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO
 benchmark). For IO intensive jobs you will need to add more nodes to
 compensate this.

 - When compared to local cluster, you will need to factor the time it takes
 for the EMR cluster to setup when starting a job. This like data transfer
 time, cluster replication time etc

 - EMR API is very flexible however you will need to build a custom interface
 on top of it to suit your job management and monitoring needs

 - EMR bootstrap actions can satisfy most of your native lib needs so no
 drawbacks there.


 -- Sudhir


 On 12/26/10 5:26 AM, common-user-digest-h...@hadoop.apache.org
 common-user-digest-h...@hadoop.apache.org wrote:

 From: Otis Gospodnetic otis_gospodne...@yahoo.com
 Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST)
 To: common-user@hadoop.apache.org
 Subject: Re: Hadoop/Elastic MR on AWS

 Hello Amandeep,



 - Original Message 
 From: Amandeep Khurana ama...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Fri, December 10, 2010 1:14:45 AM
 Subject: Re: Hadoop/Elastic MR on AWS

 Mark,

 Using EMR makes it very easy to start a cluster and add/reduce  capacity as
 and when required. There are certain optimizations that make EMR  an
 attractive choice as compared to building your own cluster out. Using  EMR


 Could you please point out what optimizations you are referring to?

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
 Hadoop ecosystem search :: http://search-hadoop.com/

 also ensures you are using a production quality, stable system backed by  
 the
 EMR engineers. You can always use bootstrap actions to put your own  tweaked
 version of Hadoop in there if you want to do that.

 Also, you  don't have to tear down your cluster after every job. You can set
 the alive  option when you start your cluster and it will stay there even
 after your  Hadoop job completes.

 If you face any issues with EMR, send me a mail  offline and I'll be happy 
 to
 help.

 -Amandeep


 On Thu, Dec 9,  2010 at 9:47 PM, Mark static.void@gmail.com  wrote:

 Does anyone have any thoughts/experiences on running Hadoop  in AWS? What
 are some pros/cons?

 Are there any good  AMI's out there for this?

 Thanks for any advice.




 iCrossing Privileged and Confidential Information
 This email message is for the sole use of the intended recipient(s) and may 
 contain confidential and privileged information of iCrossing. Any 
 unauthorized review, use, disclosure or distribution is prohibited. If you 
 are not the intended recipient, please contact the sender by reply email and 
 destroy all copies of the original message.




Re: Hadoop/Elastic MR on AWS

2010-12-24 Thread Otis Gospodnetic
Hello Amandeep,



- Original Message 
 From: Amandeep Khurana ama...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Fri, December 10, 2010 1:14:45 AM
 Subject: Re: Hadoop/Elastic MR on AWS
 
 Mark,
 
 Using EMR makes it very easy to start a cluster and add/reduce  capacity as
 and when required. There are certain optimizations that make EMR  an
 attractive choice as compared to building your own cluster out. Using  EMR


Could you please point out what optimizations you are referring to?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/

 also ensures you are using a production quality, stable system backed by  the
 EMR engineers. You can always use bootstrap actions to put your own  tweaked
 version of Hadoop in there if you want to do that.
 
 Also, you  don't have to tear down your cluster after every job. You can set
 the alive  option when you start your cluster and it will stay there even
 after your  Hadoop job completes.
 
 If you face any issues with EMR, send me a mail  offline and I'll be happy to
 help.
 
 -Amandeep
 
 
 On Thu, Dec 9,  2010 at 9:47 PM, Mark static.void@gmail.com  wrote:
 
  Does anyone have any thoughts/experiences on running Hadoop  in AWS? What
  are some pros/cons?
 
  Are there any good  AMI's out there for this?
 
  Thanks for any advice.
 
 


Re: Hadoop/Elastic MR on AWS

2010-12-24 Thread Ted Dunning
EMR instances are started near each other.  This increases the bandwidth
between nodes.

There may also be some enhancements in terms of access to the SAN that
supports EBS.

On Fri, Dec 24, 2010 at 4:41 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 - Original Message 
  From: Amandeep Khurana ama...@gmail.com
  To: common-user@hadoop.apache.org
  Sent: Fri, December 10, 2010 1:14:45 AM
  Subject: Re: Hadoop/Elastic MR on AWS
 
  Mark,
 
  Using EMR makes it very easy to start a cluster and add/reduce  capacity
 as
  and when required. There are certain optimizations that make EMR  an
  attractive choice as compared to building your own cluster out. Using
  EMR


 Could you please point out what optimizations you are referring to?



Re: Hadoop/Elastic MR on AWS

2010-12-15 Thread Steve Loughran

On 10/12/10 06:14, Amandeep Khurana wrote:

Mark,

Using EMR makes it very easy to start a cluster and add/reduce capacity as
and when required. There are certain optimizations that make EMR an
attractive choice as compared to building your own cluster out. Using EMR
also ensures you are using a production quality, stable system backed by the
EMR engineers. You can always use bootstrap actions to put your own tweaked
version of Hadoop in there if you want to do that.

Also, you don't have to tear down your cluster after every job. You can set
the alive option when you start your cluster and it will stay there even
after your Hadoop job completes.

If you face any issues with EMR, send me a mail offline and I'll be happy to
help.



How different is your distro from the apache version?


Re: Hadoop/Elastic MR on AWS

2010-12-15 Thread Steve Loughran

On 09/12/10 18:57, Aaron Eng wrote:

Pros:
- Easier to build out and tear down clusters vs. using physical machines in
a lab
- Easier to scale up and scale down a cluster as needed

Cons:
- Reliability.  In my experience I've had machines die, had machines fail to
start up, had network outages between Amazon instances, etc.  These problems
have occurred at a far more significant rate than any physical lab I have
ever administered.
- Money. You get charged for problems with their system.  Need to add
storage space to a node?  That means renting space from EBS which you then
need to actually spend time formatting to ext3 so you can use it with
Hadoop.  So every time you want to use storage, you're paying Amazon to
format it because you can't tell EBS that you want an ext3 volume.
- Visibility.  Amazon loves to report that all their services are working
properly on their website, meanwhile, the reality is that they only report
issues if they are extremely major.  Just yesterday they reported increased
latency on their us-east-1 region.  In reality, increased latency means

50% of my Amazon API calls were timing out, I could not create new

instances and for about 2 hours I could not destroy the instances I had
already spun up.  Hows that for ya?  Paying them for machines that they
won't let me terminate...



that's the harsh reality of all VMs. you need to monitor and stamp on 
things that misbehave. The nice thing is: it's easy to do this, just get 
HTTP status pages and kill any VM


This is not a fault of EC2: any VM infra has this feature. You can't 
control where your VMs come up, you are penalised by other cpu-heavy 
machines on the same server, amazon throttle the smaller machines a bit.


But you
 -don't pay for cluster time you don't need
 -don't pay for ingress/egress for data you generate in the vendor's 
infrastructure (just storage)

 -can be very agile with cluster size.

I have a talk on this topic for the curious, discussing a UI that is a 
bit more agile, but even there we deploy agents to every node to keep an 
eye on the state of the cluster.


http://www.slideshare.net/steve_l/farming-hadoop-inthecloud
http://blip.tv/file/3809976

Hadoop is designed to work well in a large-scale static cluster: fixed 
machines, with the reactions to client to server failure failure: spin 
and those of servers -blacklist clients- being the right ones to leave 
ops in control. In a virtual world you want the clients to see (somehow) 
if the master nodes have moved, you want the servers to kill the 
misbehaving VMs to save money, and then create new ones.


-Steve


Hadoop/Elastic MR on AWS

2010-12-09 Thread Mark
Does anyone have any thoughts/experiences on running Hadoop in AWS? What 
are some pros/cons?


Are there any good AMI's out there for this?

Thanks for any advice.


Re: Hadoop/Elastic MR on AWS

2010-12-09 Thread Mark Kerzner
Mark,

if nothing special is required, EMR will do fine, and you don't have to
build your cluster or shut it down, and not to worry about the underlying
AMI.

If you want your own clusters, Cloudera's distribution worked very well for
me.

Mark :)

On Thu, Dec 9, 2010 at 10:17 AM, Mark static.void@gmail.com wrote:

 Does anyone have any thoughts/experiences on running Hadoop in AWS? What
 are some pros/cons?

 Are there any good AMI's out there for this?

 Thanks for any advice.



Re: Hadoop/Elastic MR on AWS

2010-12-09 Thread Kiss Tibor
On Thu, Dec 9, 2010 at 5:17 PM, Mark static.void@gmail.com wrote:

 Does anyone have any thoughts/experiences on running Hadoop in AWS? What
 are some pros/cons?

The EMR is a possiblity. If you would like to try some MR job, it's ok, but
if you want to reuse the started instances is better to have your own setup.
Especially for small jobs is inefficient to not just start and stop new
instances, that's why I am not using EMR.

Cons:
The network connection between standard instances are not so big, in some
cases can reduce the overall performance. You cannot garantee rack locality,
your instances are picked up randomly from diverse racks, further increase
the network bandwidth problem.

Pros:
You can easily choose the size of your cluster.




 Are there any good AMI's out there for this?

I am using whirr based setup of Cloudera distribution. The cluster creation
is always starting from a clean Amazon Linux AMI (or you may select another
one) which image is not tied to Hadoop at all. So you don't need any special
AMI.



 Thanks for any advice.



Re: Hadoop/Elastic MR on AWS

2010-12-09 Thread Aaron Eng
Pros:
- Easier to build out and tear down clusters vs. using physical machines in
a lab
- Easier to scale up and scale down a cluster as needed

Cons:
- Reliability.  In my experience I've had machines die, had machines fail to
start up, had network outages between Amazon instances, etc.  These problems
have occurred at a far more significant rate than any physical lab I have
ever administered.
- Money. You get charged for problems with their system.  Need to add
storage space to a node?  That means renting space from EBS which you then
need to actually spend time formatting to ext3 so you can use it with
Hadoop.  So every time you want to use storage, you're paying Amazon to
format it because you can't tell EBS that you want an ext3 volume.
- Visibility.  Amazon loves to report that all their services are working
properly on their website, meanwhile, the reality is that they only report
issues if they are extremely major.  Just yesterday they reported increased
latency on their us-east-1 region.  In reality, increased latency means
50% of my Amazon API calls were timing out, I could not create new
instances and for about 2 hours I could not destroy the instances I had
already spun up.  Hows that for ya?  Paying them for machines that they
won't let me terminate...


This applies to both EMR and clusters you'd create yourself in EC2.  So if
you're willing to put up with not having much control over or insight into
the environment you're using, Amazon may be a good bet.  But don't expect it
to be all rainbows and daisies, you will run into problems at various points
which you did not cause and can not correct yourself, you'll have to wait
for Amazon to get their environment functioning.

On Thu, Dec 9, 2010 at 8:17 AM, Mark static.void@gmail.com wrote:

 Does anyone have any thoughts/experiences on running Hadoop in AWS? What
 are some pros/cons?

 Are there any good AMI's out there for this?

 Thanks for any advice.



Re: Hadoop/Elastic MR on AWS

2010-12-09 Thread Mark Kerzner
Actually,

 I had all these problems (like clusters failing to start) but learned to
live with them. As Aaron points out, people don't have to accept inferior
stuff, or at least should know about it.

Mark

On Thu, Dec 9, 2010 at 12:57 PM, Aaron Eng a...@maprtech.com wrote:

 Pros:
 - Easier to build out and tear down clusters vs. using physical machines in
 a lab
 - Easier to scale up and scale down a cluster as needed

 Cons:
 - Reliability.  In my experience I've had machines die, had machines fail
 to
 start up, had network outages between Amazon instances, etc.  These
 problems
 have occurred at a far more significant rate than any physical lab I have
 ever administered.
 - Money. You get charged for problems with their system.  Need to add
 storage space to a node?  That means renting space from EBS which you then
 need to actually spend time formatting to ext3 so you can use it with
 Hadoop.  So every time you want to use storage, you're paying Amazon to
 format it because you can't tell EBS that you want an ext3 volume.
 - Visibility.  Amazon loves to report that all their services are working
 properly on their website, meanwhile, the reality is that they only report
 issues if they are extremely major.  Just yesterday they reported
 increased
 latency on their us-east-1 region.  In reality, increased latency means
 50% of my Amazon API calls were timing out, I could not create new
 instances and for about 2 hours I could not destroy the instances I had
 already spun up.  Hows that for ya?  Paying them for machines that they
 won't let me terminate...


 This applies to both EMR and clusters you'd create yourself in EC2.  So if
 you're willing to put up with not having much control over or insight into
 the environment you're using, Amazon may be a good bet.  But don't expect
 it
 to be all rainbows and daisies, you will run into problems at various
 points
 which you did not cause and can not correct yourself, you'll have to wait
 for Amazon to get their environment functioning.

 On Thu, Dec 9, 2010 at 8:17 AM, Mark static.void@gmail.com wrote:

  Does anyone have any thoughts/experiences on running Hadoop in AWS? What
  are some pros/cons?
 
  Are there any good AMI's out there for this?
 
  Thanks for any advice.
 



Re: Hadoop/Elastic MR on AWS

2010-12-09 Thread Amandeep Khurana
Mark,

Using EMR makes it very easy to start a cluster and add/reduce capacity as
and when required. There are certain optimizations that make EMR an
attractive choice as compared to building your own cluster out. Using EMR
also ensures you are using a production quality, stable system backed by the
EMR engineers. You can always use bootstrap actions to put your own tweaked
version of Hadoop in there if you want to do that.

Also, you don't have to tear down your cluster after every job. You can set
the alive option when you start your cluster and it will stay there even
after your Hadoop job completes.

If you face any issues with EMR, send me a mail offline and I'll be happy to
help.

-Amandeep


On Thu, Dec 9, 2010 at 9:47 PM, Mark static.void@gmail.com wrote:

 Does anyone have any thoughts/experiences on running Hadoop in AWS? What
 are some pros/cons?

 Are there any good AMI's out there for this?

 Thanks for any advice.