Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines
Sandeep, How are you guys moving 100 TB into the AWS cloud? Are you using S3 or EBS? If you are using S3, it does not work like HDFS. Although data is replicated (I believe within an availability zone) in S3, it is not the same as HDFS replication. You lose the data locality optimization feature of Hadoop when you use S3, which runs counter to the sending code to data paradigm of MapReduce. Mind you, traffic in/out of S3 equates to costs incurred as well (when you lose data locality optimization). I hear that to get PBs worth of data into AWS, it is not uncommon to drive a truck with your data on some physical storage device (in fact, Amazon will help you do this). Please update us, this is an interesting problem. Thanks, On Thu, May 31, 2012 at 2:41 PM, Sandeep Reddy P sandeepreddy.3...@gmail.com wrote: Hi, We are getting 100TB of data with replication factor of 3 this goes to 300TB of data. We are planning to use hadoop with 65nodes. We want to know which option will be better in terms of hardware either physical Machines or deploy hadoop on EC2. Is there any document that supports use of physical machines. Hardware specs: 2 quad core cpu, 32 Gb Ram, 12*1 Tb hard drives , 10Gb Ethernet Switches costs $10k for each machine. Is that cheaper to use EC2 ?? will there be any performance issues?? -- Thanks, sandeep
Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines
Correct me if I'm wrong, but the sole cost of storing 300TB on AWS will account for roughly 30*0.10*12 = 36 USD per annum. We operate a cluster with 112 nodes offering 800+ TB of raw HDFS capacity and the CAPEX was less than 700k USD, if you ask me there is no comparison possible if you have the datacenter space to host your machines. Do you really need 10Gbe? We're quite happy with 1Gbe will no over-subscription. Mathias.
Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines
Thanks for the reply Mathias, Actual data is 100TB i think we need to host 100TB on AWS. Do we have replication even in AWS?? We are looking for comparision between performance curves/issues between physical machines and AWS?? On Thu, May 31, 2012 at 2:50 PM, Mathias Herberts mathias.herbe...@gmail.com wrote: Correct me if I'm wrong, but the sole cost of storing 300TB on AWS will account for roughly 30*0.10*12 = 36 USD per annum. We operate a cluster with 112 nodes offering 800+ TB of raw HDFS capacity and the CAPEX was less than 700k USD, if you ask me there is no comparison possible if you have the datacenter space to host your machines. Do you really need 10Gbe? We're quite happy with 1Gbe will no over-subscription. Mathias. -- Thanks, sandeep
Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines
Thanks for the reply Mathias, Actual data is 100TB i think we need to host 100TB on AWS. It's also worth noting that besides storage costs, simply moving 100TB to AWS is not a trivial task. Import/Export ( http://aws.amazon.com/importexport/) has a limit of 16TB, although they do seem like they might be flexible for larger volumes. On Thu, May 31, 2012 at 3:01 PM, Sandeep Reddy P sandeepreddy.3...@gmail.com wrote: Thanks for the reply Mathias, Actual data is 100TB i think we need to host 100TB on AWS. Do we have replication even in AWS?? We are looking for comparision between performance curves/issues between physical machines and AWS?? On Thu, May 31, 2012 at 2:50 PM, Mathias Herberts mathias.herbe...@gmail.com wrote: Correct me if I'm wrong, but the sole cost of storing 300TB on AWS will account for roughly 30*0.10*12 = 36 USD per annum. We operate a cluster with 112 nodes offering 800+ TB of raw HDFS capacity and the CAPEX was less than 700k USD, if you ask me there is no comparison possible if you have the datacenter space to host your machines. Do you really need 10Gbe? We're quite happy with 1Gbe will no over-subscription. Mathias. -- Thanks, sandeep
Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines
We actually were in an Amazon/host it yourself debate with someone. Which prompted us to do some calculations: http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/myth_busters_ops_editition_is We calculated the cost for storage alone of 300 TB on ec2 as 585K a month! The cloud people hate hearing facts like this with staggering $ values. They also do not like hearing how a $35 dollar a month physical server at Joe's datacenter is much better then an equivilent cloud machine. http://blog.carlmercier.com/2012/01/05/ec2-is-basically-one-big-ripoff/ When you bring these facts the go-to-move is go-buzzword with phrases cost of system admin, elastic, up front initial costs. I will say that Amazons EMR service is pretty cool and their is something to it, but the cost of storage and good performance is off the scale for me. On 5/31/12, Mathias Herberts mathias.herbe...@gmail.com wrote: Correct me if I'm wrong, but the sole cost of storing 300TB on AWS will account for roughly 30*0.10*12 = 36 USD per annum. We operate a cluster with 112 nodes offering 800+ TB of raw HDFS capacity and the CAPEX was less than 700k USD, if you ask me there is no comparison possible if you have the datacenter space to host your machines. Do you really need 10Gbe? We're quite happy with 1Gbe will no over-subscription. Mathias.
Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines
We once calculated the cost of using EC2 to train our machine learning model (assuming we did everything in one shot, which is almost impossible) using EM algorithm. The cost for each model is 10,000 US dollars. The cost for each individual node for each hour seems cheap, but when it scales up (multiplied by the number of nodes times the number of hours required for model training), it is still quite shocking. Shi