Re: Hadoop/Elastic MR on AWS
Are there any independent sites that collect cloud uptime numbers? Not that I know of. If you look at the full post content people have raised quite a few pros and cons. We are analyzing the AWS Cloudwatch API and see how we can leverage it to monitor EMR. EMR is offered in manu of their regions and since we are planning on using S3 as the raw data store, if one region is experiencing problems we can always look into killing the job and starting off in another region. Just a thought. http://lucene.472066.n3.nabble.com/Hadoop-Elastic-MR-on-AWS-td2058471.html On 12/28/10 8:01 PM, common-user-digest-h...@hadoop.apache.org common-user-digest-h...@hadoop.apache.org wrote: From: Lance Norskog goks...@gmail.com Date: Tue, 28 Dec 2010 18:50:14 -0800 To: common-user@hadoop.apache.org Subject: Re: Hadoop/Elastic MR on AWS Cloud providers have more uptime problems than dedicated servers. And it is impossible to benchmark: virtual server implementations do not apply quotas to I/O. I've seen the same 'instance size' have 5x deltas in disk bandwidth from one day to the next. Are there any independent sites that collect cloud uptime numbers? On Tue, Dec 28, 2010 at 5:41 PM, Sudhir Vallamkondu sudhir.vallamko...@icrossing.com wrote: Unfortunately I can't publish the exact numbers however here are the various things we considered First off our data trends. We gathered our current data size and plotted a future growth trend for the next few years. We then finalized on a archival strategy to understand how much data needs to be on the cluster on a rotating basis. We crunch our data often (meaning as we get them) so computing power is not an issue and the cluster size was mainly driven by our data size that needs to be readily available and replication strategy. We factored in compression use on older rotating data. Once we had the above numbers we could decide on our cluster infrastructure size and type of hardware needed. For local cluster we factored in hardware, warranty, regular networking stuff for cluster that size, data center costs, support manpower. We also factored in a NAS and bandwidth costs to replicate cluster data to another data center for active replication. For EMR costs we compared a reserved instance cluster (nodes reserved for 3years with similar hardware config as above) with above cluster size vs nodes on the fly. We factored in S3 costs to store the above calculated rotating data and bandwidth costs for data coming in and coming out. One thing to note is Amazon EMR costs are above normal EC2 instance costs. For example if you run a job in EMR with 4 nodes and the job overall takes 1hr then total EMR cost (excluding any data transfer costs) = 4*1*{EMR /hour} + 4*1*EC2 /hour cost. Hopefully that makes sense. I am sure missing a few things above but that's the jist of it. - Sudhir On 12/27/10 9:22 PM, common-user-digest-h...@hadoop.apache.org common-user-digest-h...@hadoop.apache.org wrote: From: Dave Viner davevi...@gmail.com Date: Mon, 27 Dec 2010 10:23:37 -0800 To: common-user@hadoop.apache.org Subject: Re: Hadoop/Elastic MR on AWS Hi Sudhir, Can you publish your findings around pricing, and how you calculated the various aspects? This is great information. Thanks Dave Viner On Mon, Dec 27, 2010 at 10:17 AM, Sudhir Vallamkondu sudhir.vallamko...@icrossing.com wrote: We recently crossed this bridge and here are some insights. We did an extensive study comparing costs and benchmarking local vs EMR for our current needs and future trend. - Scalability you get with EMR is unmatched although you need to look at your requirement and decide this is something you need. - When using EMR its cheaper to use reserved instances vs nodes on the fly. You can always add more nodes when required. I suggest looking at your current computing needs and reserve instances for a year or two and use these to run EMR and add nodes at peak needs. In your cost estimation you will need to factor in the data transfer time/costs unless you are dealing with public datasets on S3 - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO benchmark). For IO intensive jobs you will need to add more nodes to compensate this. - When compared to local cluster, you will need to factor the time it takes for the EMR cluster to setup when starting a job. This like data transfer time, cluster replication time etc - EMR API is very flexible however you will need to build a custom interface on top of it to suit your job management and monitoring needs - EMR bootstrap actions can satisfy most of your native lib needs so no drawbacks there. -- Sudhir On 12/26/10 5:26 AM, common-user-digest-h...@hadoop.apache.org common-user-digest-h...@hadoop.apache.org wrote: From: Otis Gospodnetic otis_gospodne...@yahoo.com
Re: Hadoop/Elastic MR on AWS
Unfortunately I can't publish the exact numbers however here are the various things we considered First off our data trends. We gathered our current data size and plotted a future growth trend for the next few years. We then finalized on a archival strategy to understand how much data needs to be on the cluster on a rotating basis. We crunch our data often (meaning as we get them) so computing power is not an issue and the cluster size was mainly driven by our data size that needs to be readily available and replication strategy. We factored in compression use on older rotating data. Once we had the above numbers we could decide on our cluster infrastructure size and type of hardware needed. For local cluster we factored in hardware, warranty, regular networking stuff for cluster that size, data center costs, support manpower. We also factored in a NAS and bandwidth costs to replicate cluster data to another data center for active replication. For EMR costs we compared a reserved instance cluster (nodes reserved for 3years with similar hardware config as above) with above cluster size vs nodes on the fly. We factored in S3 costs to store the above calculated rotating data and bandwidth costs for data coming in and coming out. One thing to note is Amazon EMR costs are above normal EC2 instance costs. For example if you run a job in EMR with 4 nodes and the job overall takes 1hr then total EMR cost (excluding any data transfer costs) = 4*1*{EMR /hour} + 4*1*EC2 /hour cost. Hopefully that makes sense. I am sure missing a few things above but that's the jist of it. - Sudhir On 12/27/10 9:22 PM, common-user-digest-h...@hadoop.apache.org common-user-digest-h...@hadoop.apache.org wrote: From: Dave Viner davevi...@gmail.com Date: Mon, 27 Dec 2010 10:23:37 -0800 To: common-user@hadoop.apache.org Subject: Re: Hadoop/Elastic MR on AWS Hi Sudhir, Can you publish your findings around pricing, and how you calculated the various aspects? This is great information. Thanks Dave Viner On Mon, Dec 27, 2010 at 10:17 AM, Sudhir Vallamkondu sudhir.vallamko...@icrossing.com wrote: We recently crossed this bridge and here are some insights. We did an extensive study comparing costs and benchmarking local vs EMR for our current needs and future trend. - Scalability you get with EMR is unmatched although you need to look at your requirement and decide this is something you need. - When using EMR its cheaper to use reserved instances vs nodes on the fly. You can always add more nodes when required. I suggest looking at your current computing needs and reserve instances for a year or two and use these to run EMR and add nodes at peak needs. In your cost estimation you will need to factor in the data transfer time/costs unless you are dealing with public datasets on S3 - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO benchmark). For IO intensive jobs you will need to add more nodes to compensate this. - When compared to local cluster, you will need to factor the time it takes for the EMR cluster to setup when starting a job. This like data transfer time, cluster replication time etc - EMR API is very flexible however you will need to build a custom interface on top of it to suit your job management and monitoring needs - EMR bootstrap actions can satisfy most of your native lib needs so no drawbacks there. -- Sudhir On 12/26/10 5:26 AM, common-user-digest-h...@hadoop.apache.org common-user-digest-h...@hadoop.apache.org wrote: From: Otis Gospodnetic otis_gospodne...@yahoo.com Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST) To: common-user@hadoop.apache.org Subject: Re: Hadoop/Elastic MR on AWS Hello Amandeep, - Original Message From: Amandeep Khurana ama...@gmail.com To: common-user@hadoop.apache.org Sent: Fri, December 10, 2010 1:14:45 AM Subject: Re: Hadoop/Elastic MR on AWS Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR Could you please point out what optimizations you are referring to? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase Hadoop ecosystem search :: http://search-hadoop.com/ also ensures you are using a production quality, stable system backed by the EMR engineers. You can always use bootstrap actions to put your own tweaked version of Hadoop in there if you want to do that. Also, you don't have to tear down your cluster after every job. You can set the alive option when you start your cluster and it will stay there even after your Hadoop job completes. If you face any issues with EMR, send me a mail offline and I'll be happy to help
Re: Hadoop/Elastic MR on AWS
We recently crossed this bridge and here are some insights. We did an extensive study comparing costs and benchmarking local vs EMR for our current needs and future trend. - Scalability you get with EMR is unmatched although you need to look at your requirement and decide this is something you need. - When using EMR its cheaper to use reserved instances vs nodes on the fly. You can always add more nodes when required. I suggest looking at your current computing needs and reserve instances for a year or two and use these to run EMR and add nodes at peak needs. In your cost estimation you will need to factor in the data transfer time/costs unless you are dealing with public datasets on S3 - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO benchmark). For IO intensive jobs you will need to add more nodes to compensate this. - When compared to local cluster, you will need to factor the time it takes for the EMR cluster to setup when starting a job. This like data transfer time, cluster replication time etc - EMR API is very flexible however you will need to build a custom interface on top of it to suit your job management and monitoring needs - EMR bootstrap actions can satisfy most of your native lib needs so no drawbacks there. -- Sudhir On 12/26/10 5:26 AM, common-user-digest-h...@hadoop.apache.org common-user-digest-h...@hadoop.apache.org wrote: From: Otis Gospodnetic otis_gospodne...@yahoo.com Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST) To: common-user@hadoop.apache.org Subject: Re: Hadoop/Elastic MR on AWS Hello Amandeep, - Original Message From: Amandeep Khurana ama...@gmail.com To: common-user@hadoop.apache.org Sent: Fri, December 10, 2010 1:14:45 AM Subject: Re: Hadoop/Elastic MR on AWS Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR Could you please point out what optimizations you are referring to? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase Hadoop ecosystem search :: http://search-hadoop.com/ also ensures you are using a production quality, stable system backed by the EMR engineers. You can always use bootstrap actions to put your own tweaked version of Hadoop in there if you want to do that. Also, you don't have to tear down your cluster after every job. You can set the alive option when you start your cluster and it will stay there even after your Hadoop job completes. If you face any issues with EMR, send me a mail offline and I'll be happy to help. -Amandeep On Thu, Dec 9, 2010 at 9:47 PM, Mark static.void@gmail.com wrote: Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? Are there any good AMI's out there for this? Thanks for any advice. iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Re: Hadoop/Elastic MR on AWS
Hi Sudhir, Can you publish your findings around pricing, and how you calculated the various aspects? This is great information. Thanks Dave Viner On Mon, Dec 27, 2010 at 10:17 AM, Sudhir Vallamkondu sudhir.vallamko...@icrossing.com wrote: We recently crossed this bridge and here are some insights. We did an extensive study comparing costs and benchmarking local vs EMR for our current needs and future trend. - Scalability you get with EMR is unmatched although you need to look at your requirement and decide this is something you need. - When using EMR its cheaper to use reserved instances vs nodes on the fly. You can always add more nodes when required. I suggest looking at your current computing needs and reserve instances for a year or two and use these to run EMR and add nodes at peak needs. In your cost estimation you will need to factor in the data transfer time/costs unless you are dealing with public datasets on S3 - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO benchmark). For IO intensive jobs you will need to add more nodes to compensate this. - When compared to local cluster, you will need to factor the time it takes for the EMR cluster to setup when starting a job. This like data transfer time, cluster replication time etc - EMR API is very flexible however you will need to build a custom interface on top of it to suit your job management and monitoring needs - EMR bootstrap actions can satisfy most of your native lib needs so no drawbacks there. -- Sudhir On 12/26/10 5:26 AM, common-user-digest-h...@hadoop.apache.org common-user-digest-h...@hadoop.apache.org wrote: From: Otis Gospodnetic otis_gospodne...@yahoo.com Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST) To: common-user@hadoop.apache.org Subject: Re: Hadoop/Elastic MR on AWS Hello Amandeep, - Original Message From: Amandeep Khurana ama...@gmail.com To: common-user@hadoop.apache.org Sent: Fri, December 10, 2010 1:14:45 AM Subject: Re: Hadoop/Elastic MR on AWS Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR Could you please point out what optimizations you are referring to? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase Hadoop ecosystem search :: http://search-hadoop.com/ also ensures you are using a production quality, stable system backed by the EMR engineers. You can always use bootstrap actions to put your own tweaked version of Hadoop in there if you want to do that. Also, you don't have to tear down your cluster after every job. You can set the alive option when you start your cluster and it will stay there even after your Hadoop job completes. If you face any issues with EMR, send me a mail offline and I'll be happy to help. -Amandeep On Thu, Dec 9, 2010 at 9:47 PM, Mark static.void@gmail.com wrote: Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? Are there any good AMI's out there for this? Thanks for any advice. iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Re: Hadoop/Elastic MR on AWS
Thank you for sharing. Sent from my mobile. Please excuse the typos. On 2010-12-27, at 11:18 AM, Sudhir Vallamkondu sudhir.vallamko...@icrossing.com wrote: We recently crossed this bridge and here are some insights. We did an extensive study comparing costs and benchmarking local vs EMR for our current needs and future trend. - Scalability you get with EMR is unmatched although you need to look at your requirement and decide this is something you need. - When using EMR its cheaper to use reserved instances vs nodes on the fly. You can always add more nodes when required. I suggest looking at your current computing needs and reserve instances for a year or two and use these to run EMR and add nodes at peak needs. In your cost estimation you will need to factor in the data transfer time/costs unless you are dealing with public datasets on S3 - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO benchmark). For IO intensive jobs you will need to add more nodes to compensate this. - When compared to local cluster, you will need to factor the time it takes for the EMR cluster to setup when starting a job. This like data transfer time, cluster replication time etc - EMR API is very flexible however you will need to build a custom interface on top of it to suit your job management and monitoring needs - EMR bootstrap actions can satisfy most of your native lib needs so no drawbacks there. -- Sudhir On 12/26/10 5:26 AM, common-user-digest-h...@hadoop.apache.org common-user-digest-h...@hadoop.apache.org wrote: From: Otis Gospodnetic otis_gospodne...@yahoo.com Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST) To: common-user@hadoop.apache.org Subject: Re: Hadoop/Elastic MR on AWS Hello Amandeep, - Original Message From: Amandeep Khurana ama...@gmail.com To: common-user@hadoop.apache.org Sent: Fri, December 10, 2010 1:14:45 AM Subject: Re: Hadoop/Elastic MR on AWS Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR Could you please point out what optimizations you are referring to? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase Hadoop ecosystem search :: http://search-hadoop.com/ also ensures you are using a production quality, stable system backed by the EMR engineers. You can always use bootstrap actions to put your own tweaked version of Hadoop in there if you want to do that. Also, you don't have to tear down your cluster after every job. You can set the alive option when you start your cluster and it will stay there even after your Hadoop job completes. If you face any issues with EMR, send me a mail offline and I'll be happy to help. -Amandeep On Thu, Dec 9, 2010 at 9:47 PM, Mark static.void@gmail.com wrote: Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? Are there any good AMI's out there for this? Thanks for any advice. iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Re: Hadoop/Elastic MR on AWS
Hello Amandeep, - Original Message From: Amandeep Khurana ama...@gmail.com To: common-user@hadoop.apache.org Sent: Fri, December 10, 2010 1:14:45 AM Subject: Re: Hadoop/Elastic MR on AWS Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR Could you please point out what optimizations you are referring to? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase Hadoop ecosystem search :: http://search-hadoop.com/ also ensures you are using a production quality, stable system backed by the EMR engineers. You can always use bootstrap actions to put your own tweaked version of Hadoop in there if you want to do that. Also, you don't have to tear down your cluster after every job. You can set the alive option when you start your cluster and it will stay there even after your Hadoop job completes. If you face any issues with EMR, send me a mail offline and I'll be happy to help. -Amandeep On Thu, Dec 9, 2010 at 9:47 PM, Mark static.void@gmail.com wrote: Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? Are there any good AMI's out there for this? Thanks for any advice.
Re: Hadoop/Elastic MR on AWS
EMR instances are started near each other. This increases the bandwidth between nodes. There may also be some enhancements in terms of access to the SAN that supports EBS. On Fri, Dec 24, 2010 at 4:41 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: - Original Message From: Amandeep Khurana ama...@gmail.com To: common-user@hadoop.apache.org Sent: Fri, December 10, 2010 1:14:45 AM Subject: Re: Hadoop/Elastic MR on AWS Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR Could you please point out what optimizations you are referring to?
Re: Hadoop/Elastic MR on AWS
On 10/12/10 06:14, Amandeep Khurana wrote: Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR also ensures you are using a production quality, stable system backed by the EMR engineers. You can always use bootstrap actions to put your own tweaked version of Hadoop in there if you want to do that. Also, you don't have to tear down your cluster after every job. You can set the alive option when you start your cluster and it will stay there even after your Hadoop job completes. If you face any issues with EMR, send me a mail offline and I'll be happy to help. How different is your distro from the apache version?
Re: Hadoop/Elastic MR on AWS
On 09/12/10 18:57, Aaron Eng wrote: Pros: - Easier to build out and tear down clusters vs. using physical machines in a lab - Easier to scale up and scale down a cluster as needed Cons: - Reliability. In my experience I've had machines die, had machines fail to start up, had network outages between Amazon instances, etc. These problems have occurred at a far more significant rate than any physical lab I have ever administered. - Money. You get charged for problems with their system. Need to add storage space to a node? That means renting space from EBS which you then need to actually spend time formatting to ext3 so you can use it with Hadoop. So every time you want to use storage, you're paying Amazon to format it because you can't tell EBS that you want an ext3 volume. - Visibility. Amazon loves to report that all their services are working properly on their website, meanwhile, the reality is that they only report issues if they are extremely major. Just yesterday they reported increased latency on their us-east-1 region. In reality, increased latency means 50% of my Amazon API calls were timing out, I could not create new instances and for about 2 hours I could not destroy the instances I had already spun up. Hows that for ya? Paying them for machines that they won't let me terminate... that's the harsh reality of all VMs. you need to monitor and stamp on things that misbehave. The nice thing is: it's easy to do this, just get HTTP status pages and kill any VM This is not a fault of EC2: any VM infra has this feature. You can't control where your VMs come up, you are penalised by other cpu-heavy machines on the same server, amazon throttle the smaller machines a bit. But you -don't pay for cluster time you don't need -don't pay for ingress/egress for data you generate in the vendor's infrastructure (just storage) -can be very agile with cluster size. I have a talk on this topic for the curious, discussing a UI that is a bit more agile, but even there we deploy agents to every node to keep an eye on the state of the cluster. http://www.slideshare.net/steve_l/farming-hadoop-inthecloud http://blip.tv/file/3809976 Hadoop is designed to work well in a large-scale static cluster: fixed machines, with the reactions to client to server failure failure: spin and those of servers -blacklist clients- being the right ones to leave ops in control. In a virtual world you want the clients to see (somehow) if the master nodes have moved, you want the servers to kill the misbehaving VMs to save money, and then create new ones. -Steve
Hadoop/Elastic MR on AWS
Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? Are there any good AMI's out there for this? Thanks for any advice.
Re: Hadoop/Elastic MR on AWS
Mark, if nothing special is required, EMR will do fine, and you don't have to build your cluster or shut it down, and not to worry about the underlying AMI. If you want your own clusters, Cloudera's distribution worked very well for me. Mark :) On Thu, Dec 9, 2010 at 10:17 AM, Mark static.void@gmail.com wrote: Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? Are there any good AMI's out there for this? Thanks for any advice.
Re: Hadoop/Elastic MR on AWS
On Thu, Dec 9, 2010 at 5:17 PM, Mark static.void@gmail.com wrote: Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? The EMR is a possiblity. If you would like to try some MR job, it's ok, but if you want to reuse the started instances is better to have your own setup. Especially for small jobs is inefficient to not just start and stop new instances, that's why I am not using EMR. Cons: The network connection between standard instances are not so big, in some cases can reduce the overall performance. You cannot garantee rack locality, your instances are picked up randomly from diverse racks, further increase the network bandwidth problem. Pros: You can easily choose the size of your cluster. Are there any good AMI's out there for this? I am using whirr based setup of Cloudera distribution. The cluster creation is always starting from a clean Amazon Linux AMI (or you may select another one) which image is not tied to Hadoop at all. So you don't need any special AMI. Thanks for any advice.
Re: Hadoop/Elastic MR on AWS
Pros: - Easier to build out and tear down clusters vs. using physical machines in a lab - Easier to scale up and scale down a cluster as needed Cons: - Reliability. In my experience I've had machines die, had machines fail to start up, had network outages between Amazon instances, etc. These problems have occurred at a far more significant rate than any physical lab I have ever administered. - Money. You get charged for problems with their system. Need to add storage space to a node? That means renting space from EBS which you then need to actually spend time formatting to ext3 so you can use it with Hadoop. So every time you want to use storage, you're paying Amazon to format it because you can't tell EBS that you want an ext3 volume. - Visibility. Amazon loves to report that all their services are working properly on their website, meanwhile, the reality is that they only report issues if they are extremely major. Just yesterday they reported increased latency on their us-east-1 region. In reality, increased latency means 50% of my Amazon API calls were timing out, I could not create new instances and for about 2 hours I could not destroy the instances I had already spun up. Hows that for ya? Paying them for machines that they won't let me terminate... This applies to both EMR and clusters you'd create yourself in EC2. So if you're willing to put up with not having much control over or insight into the environment you're using, Amazon may be a good bet. But don't expect it to be all rainbows and daisies, you will run into problems at various points which you did not cause and can not correct yourself, you'll have to wait for Amazon to get their environment functioning. On Thu, Dec 9, 2010 at 8:17 AM, Mark static.void@gmail.com wrote: Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? Are there any good AMI's out there for this? Thanks for any advice.
Re: Hadoop/Elastic MR on AWS
Actually, I had all these problems (like clusters failing to start) but learned to live with them. As Aaron points out, people don't have to accept inferior stuff, or at least should know about it. Mark On Thu, Dec 9, 2010 at 12:57 PM, Aaron Eng a...@maprtech.com wrote: Pros: - Easier to build out and tear down clusters vs. using physical machines in a lab - Easier to scale up and scale down a cluster as needed Cons: - Reliability. In my experience I've had machines die, had machines fail to start up, had network outages between Amazon instances, etc. These problems have occurred at a far more significant rate than any physical lab I have ever administered. - Money. You get charged for problems with their system. Need to add storage space to a node? That means renting space from EBS which you then need to actually spend time formatting to ext3 so you can use it with Hadoop. So every time you want to use storage, you're paying Amazon to format it because you can't tell EBS that you want an ext3 volume. - Visibility. Amazon loves to report that all their services are working properly on their website, meanwhile, the reality is that they only report issues if they are extremely major. Just yesterday they reported increased latency on their us-east-1 region. In reality, increased latency means 50% of my Amazon API calls were timing out, I could not create new instances and for about 2 hours I could not destroy the instances I had already spun up. Hows that for ya? Paying them for machines that they won't let me terminate... This applies to both EMR and clusters you'd create yourself in EC2. So if you're willing to put up with not having much control over or insight into the environment you're using, Amazon may be a good bet. But don't expect it to be all rainbows and daisies, you will run into problems at various points which you did not cause and can not correct yourself, you'll have to wait for Amazon to get their environment functioning. On Thu, Dec 9, 2010 at 8:17 AM, Mark static.void@gmail.com wrote: Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? Are there any good AMI's out there for this? Thanks for any advice.
Re: Hadoop/Elastic MR on AWS
Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR also ensures you are using a production quality, stable system backed by the EMR engineers. You can always use bootstrap actions to put your own tweaked version of Hadoop in there if you want to do that. Also, you don't have to tear down your cluster after every job. You can set the alive option when you start your cluster and it will stay there even after your Hadoop job completes. If you face any issues with EMR, send me a mail offline and I'll be happy to help. -Amandeep On Thu, Dec 9, 2010 at 9:47 PM, Mark static.void@gmail.com wrote: Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? Are there any good AMI's out there for this? Thanks for any advice.