Re: Hadoop/Elastic MR on AWS

Sudhir Vallamkondu Wed, 29 Dec 2010 08:32:35 -0800

> Are there any independent sites that collect cloud uptime numbers?
Not that I know of.


If you look at the full post content people have raised quite a few pros and
cons. We are analyzing the AWS Cloudwatch API and see how we can leverage it
to monitor EMR. EMR is offered in manu of their regions and since we are
planning on using S3 as the raw data store, if one region is experiencing
problems we can always look into killing the job and starting off in another
region. Just a thought.

http://lucene.472066.n3.nabble.com/Hadoop-Elastic-MR-on-AWS-td2058471.html


On 12/28/10 8:01 PM, "common-user-digest-h...@hadoop.apache.org"
<common-user-digest-h...@hadoop.apache.org> wrote:

> From: Lance Norskog <goks...@gmail.com>
> Date: Tue, 28 Dec 2010 18:50:14 -0800
> To: <common-user@hadoop.apache.org>
> Subject: Re: Hadoop/Elastic MR on AWS
> 
> Cloud providers have more uptime problems than dedicated servers. And
> it is impossible to benchmark: virtual server implementations do not
> apply quotas to I/O. I've seen the same 'instance size' have 5x deltas
> in disk bandwidth from one day to the next.
> 
> Are there any independent sites that collect cloud uptime numbers?
> 
> On Tue, Dec 28, 2010 at 5:41 PM, Sudhir Vallamkondu
> <sudhir.vallamko...@icrossing.com> wrote:
>> Unfortunately I can't publish the exact numbers however here are the various
>> things we considered
>> 
>> First off our data trends. We gathered our current data size and plotted a
>> future growth trend for the next few years. We then finalized on a archival
>> strategy to understand how much data needs to be on the cluster on a
>> rotating basis. We crunch our data often (meaning as we get them) so
>> computing power is not an issue and the cluster size was mainly driven by
>> our data size that needs to be readily available and replication strategy.
>> We factored in compression use on older rotating data.
>> 
>> Once we had the above numbers we could decide on our cluster infrastructure
>> size and type of hardware needed.
>> 
>> For local cluster we factored in hardware, warranty, regular networking
>> stuff for cluster that size, data center costs, support manpower. We also
>> factored in a NAS and bandwidth costs to replicate cluster data to another
>> data center for active replication.
>> 
>> For EMR costs we compared a reserved instance cluster (nodes reserved for
>> 3years with similar hardware config as above) with above cluster size vs
>> nodes on the fly. We factored in S3 costs to store the above calculated
>> rotating data and bandwidth costs for data coming in and coming out. One
>> thing to note is Amazon EMR costs are above normal EC2 instance costs. For
>> example if you run a job in EMR with 4 nodes and the job overall takes 1hr
>> then total EMR cost (excluding any data transfer costs) = 4*1*{EMR /hour} +
>> 4*1*EC2 /hour cost. Hopefully that makes sense.
>> 
>> I am sure missing a few things above but that's the jist of it.
>> 
>> - Sudhir
>> 
>> 
>> 
>> 
>> 
>> 
>> On 12/27/10 9:22 PM, "common-user-digest-h...@hadoop.apache.org"
>> <common-user-digest-h...@hadoop.apache.org> wrote:
>> 
>>> From: Dave Viner <davevi...@gmail.com>
>>> Date: Mon, 27 Dec 2010 10:23:37 -0800
>>> To: <common-user@hadoop.apache.org>
>>> Subject: Re: Hadoop/Elastic MR on AWS
>>> 
>>> Hi Sudhir,
>>> 
>>> Can you publish your findings around pricing, and how you calculated the
>>> various aspects?
>>> 
>>> This is great information.
>>> 
>>> Thanks
>>> Dave Viner
>>> 
>>> 
>>> On Mon, Dec 27, 2010 at 10:17 AM, Sudhir Vallamkondu <
>>> sudhir.vallamko...@icrossing.com> wrote:
>>> 
>>>> We recently crossed this bridge and here are some insights. We did an
>>>> extensive study comparing costs and benchmarking local vs EMR for our
>>>> current needs and future trend.
>>>> 
>>>> - Scalability you get with EMR is unmatched although you need to look at
>>>> your requirement and decide this is something you need.
>>>> 
>>>> - When using EMR its cheaper to use reserved instances vs nodes on the fly.
>>>> You can always add more nodes when required. I suggest looking at your
>>>> current computing needs and reserve instances for a year or two and use
>>>> these to run EMR and add nodes at peak needs. In your cost estimation you
>>>> will need to factor in the data transfer time/costs unless you are dealing
>>>> with public datasets on S3
>>>> 
>>>> - EMR fared similar to local cluster on CPU benchmarks (we used MRBench to
>>>> benchmark map/reduce) however IO benchmarks were slow on EMR (used DFSIO
>>>> benchmark). For IO intensive jobs you will need to add more nodes to
>>>> compensate this.
>>>> 
>>>> - When compared to local cluster, you will need to factor the time it takes
>>>> for the EMR cluster to setup when starting a job. This like data transfer
>>>> time, cluster replication time etc
>>>> 
>>>> - EMR API is very flexible however you will need to build a custom
>>>> interface
>>>> on top of it to suit your job management and monitoring needs
>>>> 
>>>> - EMR bootstrap actions can satisfy most of your native lib needs so no
>>>> drawbacks there.
>>>> 
>>>> 
>>>> -- Sudhir
>>>> 
>>>> 
>>>> On 12/26/10 5:26 AM, "common-user-digest-h...@hadoop.apache.org"
>>>> <common-user-digest-h...@hadoop.apache.org> wrote:
>>>> 
>>>>> From: Otis Gospodnetic <otis_gospodne...@yahoo.com>
>>>>> Date: Fri, 24 Dec 2010 04:41:46 -0800 (PST)
>>>>> To: <common-user@hadoop.apache.org>
>>>>> Subject: Re: Hadoop/Elastic MR on AWS
>>>>> 
>>>>> Hello Amandeep,
>>>>> 
>>>>> 
>>>>> 
>>>>> ----- Original Message ----
>>>>>> From: Amandeep Khurana <ama...@gmail.com>
>>>>>> To: common-user@hadoop.apache.org
>>>>>> Sent: Fri, December 10, 2010 1:14:45 AM
>>>>>> Subject: Re: Hadoop/Elastic MR on AWS
>>>>>> 
>>>>>> Mark,
>>>>>> 
>>>>>> Using EMR makes it very easy to start a cluster and add/reduce  capacity
>>>> as
>>>>>> and when required. There are certain optimizations that make EMR  an
>>>>>> attractive choice as compared to building your own cluster out. Using
>>>>  EMR
>>>>> 
>>>>> 
>>>>> Could you please point out what optimizations you are referring to?
>>>>> 
>>>>> Thanks,
>>>>> Otis
>>>>> ----
>>>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
>>>> HBase
>>>>> Hadoop ecosystem search :: http://search-hadoop.com/
>>>>> 
>>>>>> also ensures you are using a production quality, stable system backed by
>>>>  the
>>>>>> EMR engineers. You can always use bootstrap actions to put your own
>>>>  tweaked
>>>>>> version of Hadoop in there if you want to do that.
>>>>>> 
>>>>>> Also, you  don't have to tear down your cluster after every job. You can
>>>> set
>>>>>> the alive  option when you start your cluster and it will stay there
>>>> even
>>>>>> after your  Hadoop job completes.
>>>>>> 
>>>>>> If you face any issues with EMR, send me a mail  offline and I'll be
>>>> happy to
>>>>>> help.
>>>>>> 
>>>>>> -Amandeep
>>>>>> 
>>>>>> 
>>>>>> On Thu, Dec 9,  2010 at 9:47 PM, Mark <static.void....@gmail.com>
>>>>  wrote:
>>>>>> 
>>>>>>> Does anyone have any thoughts/experiences on running Hadoop  in AWS?
>>>> What
>>>>>>> are some pros/cons?
>>>>>>> 
>>>>>>> Are there any good  AMI's out there for this?
>>>>>>> 
>>>>>>> Thanks for any advice.
>>>>>>> 
>>>>>> 
>> 
>> 
>> 
>> iCrossing Privileged and Confidential Information
>> This email message is for the sole use of the intended recipient(s) and may
>> contain confidential and privileged information of iCrossing. Any
>> unauthorized review, use, disclosure or distribution is prohibited. If you
>> are not the intended recipient, please contact the sender by reply email and
>> destroy all copies of the original message.
>> 
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com

Re: Hadoop/Elastic MR on AWS

Reply via email to