RE: Why Hadoop is slow in Cloud
Even with performance hit, there are still benefits running Hadoop this way -as you only consume/pay for CPU time you use, if you are only running batch jobs, its lower cost than having a hadoop cluster that is under- used. -if your data is stored in the cloud infrastructure, then you need to data mine it in VMs, unless you want to take the time and money hit of moving it out, and have somewhere to store it. -if the infrastructure lets you, you can lock down the cluster so it is secure. Where a physical cluster is good is that it is a very low cost way of storing data, provided you can analyse it with Hadoop, and provided you can keep that cluster busy most of the time, either with Hadoop work or other scheduled work. If your cluster is idle for computation, you are still paying the capital and (reduced) electricity costs, so the cost of storage and what compute you do effectively increases. Agreed, but this has little to do with Hadoop as a middleware and more to do with the benefits of virtualized vs physical infrastructure. I agree that it is convenient to use HDFS as a DFS to keep your data local to your VMs, but you could choose other DFS's as well. The major benefit of Hadoop is its data-locality principle, and this is what you give up when you move to the cloud. Regardless of whether you store your data within your VM or on a NAS, it *will* have to travel over a line. As soon as that happens you lose the benefit of data-locality and are left with MapReduce as a way for parallel computing. And in that case you could use less restrictive software, like maybe PBS. You could even install HOD on your virtual cluster, if you'd like the possibility of MapReduce. Adarsh, there are probably results around of more generic benchmark tools (Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should give you a better idea of the penalties of virtualization. (Our experience with a number of technologies on our OpenNebula cloud is, like Steve points out, that you mainly pay for disk I/O performance.) I think a decision to go with either cloud or physical infrastructure should be based on the frequency, intensity and types of computation you expect on the short term (that should include operations dealing with data), and the way you think these parameters will develop on a mid-long term. And then compare the prices of a physical cluster that meets those demands (make sure to include power and operations) and the investment you would otherwise need to make in Cloud. smime.p7s Description: S/MIME cryptographic signature
Re: Why Hadoop is slow in Cloud
On 20/01/11 23:24, Marc Farnum Rendino wrote: On Wed, Jan 19, 2011 at 2:50 PM, Edward Caprioloedlinuxg...@gmail.com wrote: As for virtualization,paravirtualization,emulation.(whatever ulization) Wow; that's a really big category. There are always a lot of variables, but the net result is always less. It may be 2% 10% or 15%, but it is always less. If it's less of something I don't care about, it's not a factor (for me). On the other hand, if I'm paying less and getting more of what I DO care about, I'd rather go with that. It's about the cost/benefit *ratio*. There's also perf vs storage. On a big cluster, you could add a second Nehalem CPU and maybe get 10-15% boost on throughput, or for the same capex and opex add 10% new servers, which at scale means many more TB of storage and the compute to go with it. The decision rests with the team and their problems.
Re: Why Hadoop is slow in Cloud
On 21/01/11 09:20, Evert Lammerts wrote: Even with performance hit, there are still benefits running Hadoop this way -as you only consume/pay for CPU time you use, if you are only running batch jobs, its lower cost than having a hadoop cluster that is under- used. -if your data is stored in the cloud infrastructure, then you need to data mine it in VMs, unless you want to take the time and money hit of moving it out, and have somewhere to store it. -if the infrastructure lets you, you can lock down the cluster so it is secure. Where a physical cluster is good is that it is a very low cost way of storing data, provided you can analyse it with Hadoop, and provided you can keep that cluster busy most of the time, either with Hadoop work or other scheduled work. If your cluster is idle for computation, you are still paying the capital and (reduced) electricity costs, so the cost of storage and what compute you do effectively increases. Agreed, but this has little to do with Hadoop as a middleware and more to do with the benefits of virtualized vs physical infrastructure. I agree that it is convenient to use HDFS as a DFS to keep your data local to your VMs, but you could choose other DFS's as well. We don't use HDFS, we bring up VMs close to where the data persists. http://www.slideshare.net/steve_l/high-availability-hadoop The major benefit of Hadoop is its data-locality principle, and this is what you give up when you move to the cloud. Regardless of whether you store your data within your VM or on a NAS, it *will* have to travel over a line. As soon as that happens you lose the benefit of data-locality and are left with MapReduce as a way for parallel computing. And in that case you could use less restrictive software, like maybe PBS. You could even install HOD on your virtual cluster, if you'd like the possibility of MapReduce. We don't suffer locality hits so much, but you do pay for the extra infrastructure costs of a more agile datacentre, and if you go to redundancy in hardware over replication, you have less places to run your code. Even on EC2, which doesn't let you tell it what datasets you want to play with for its VM placer to use in its decisions, once data is in the datanodes you do get locality Adarsh, there are probably results around of more generic benchmark tools (Phoronix, POV-Ray, ...) for I/O and CPU performance in a VM. Those should give you a better idea of the penalties of virtualization. (Our experience with a number of technologies on our OpenNebula cloud is, like Steve points out, that you mainly pay for disk I/O performance.) -would be interesting to see anything you can publish there... I think a decision to go with either cloud or physical infrastructure should be based on the frequency, intensity and types of computation you expect on the short term (that should include operations dealing with data), and the way you think these parameters will develop on a mid-long term. And then compare the prices of a physical cluster that meets those demands (make sure to include power and operations) and the investment you would otherwise need to make in Cloud. +1
Re: Why Hadoop is slow in Cloud
On Wed, Jan 19, 2011 at 2:50 PM, Edward Capriolo edlinuxg...@gmail.com wrote: As for virtualization,paravirtualization,emulation.(whatever ulization) Wow; that's a really big category. There are always a lot of variables, but the net result is always less. It may be 2% 10% or 15%, but it is always less. If it's less of something I don't care about, it's not a factor (for me). On the other hand, if I'm paying less and getting more of what I DO care about, I'd rather go with that. It's about the cost/benefit *ratio*.
Re: Why Hadoop is slow in Cloud
On Tue, Jan 18, 2011 at 8:59 AM, Adarsh Sharma adarsh.sha...@orkash.com wrote: I want to know *AT WHAT COSTS *it comes. 10-15% is tolerable but at this rate, it needs some work. As Steve rightly suggest , I am in some CPU bound testing work to know the exact stats. Yep; you've got to test your own workflow to see how it's affected by your conditions - lots of variables. BTW: For AWS (Amazon) there are significant differences in I/O, for different instance types; if I recall correctly, for best I/O, start no lower than m1.large. And the three storage types (instance, EBS, and S3) have different characteristics as well; I'd start with EBS, though I haven't worked much with S3 yet, and that does offer some benefits.
Re: Why Hadoop is slow in Cloud
On Wed, Jan 19, 2011 at 1:32 PM, Marc Farnum Rendino mvg...@gmail.com wrote: On Tue, Jan 18, 2011 at 8:59 AM, Adarsh Sharma adarsh.sha...@orkash.com wrote: I want to know *AT WHAT COSTS *it comes. 10-15% is tolerable but at this rate, it needs some work. As Steve rightly suggest , I am in some CPU bound testing work to know the exact stats. Yep; you've got to test your own workflow to see how it's affected by your conditions - lots of variables. BTW: For AWS (Amazon) there are significant differences in I/O, for different instance types; if I recall correctly, for best I/O, start no lower than m1.large. And the three storage types (instance, EBS, and S3) have different characteristics as well; I'd start with EBS, though I haven't worked much with S3 yet, and that does offer some benefits. As for virtualization,paravirtualization,emulation.(whatever ulization) There are always a lot of variables, but the net result is always less. It may be 2% 10% or 15%, but it is always less. A $50,000 server and such a solution takes 10% performance right off the top. There goes $5,000.00 performance right out the window. I never think throwing away performance was acceptable ( I was born without a silver SSD in my crib). Plus some people even pay for virtualization software (vendors will remain nameless) Truly paying for less.
Re: Why Hadoop is slow in Cloud
Virtualization != Emulation Yes, virtualization does have its own costs (as does running directly on hardware) - depending on the specifics of both the virtualization *and* the task at hand. If my task (in the general sense) is CPU bound, it doesn't matter (to me) if the virtualization has a disk I/O penalty. If on the other hand, my task is limited by a disk I/O penalty, I'll weigh that into the *total* cost/benefit, and virtualization may not - or may still - be an advantageous choice. Context is king. On Mon, Jan 17, 2011 at 10:41 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Everything you emulate you cut X% performance right off the top...
Re: Why Hadoop is slow in Cloud
Marc Farnum Rendino wrote: Virtualization != Emulation Yes, virtualization does have its own costs (as does running directly on hardware) - depending on the specifics of both the virtualization *and* the task at hand. Absolutely right, and for this I perform the initial testing. I want to know *AT WHAT COSTS *it comes. 10-15% is tolerable but at this rate, it needs some work. As Steve rightly suggest , I am in some CPU bound testing work to know the exact stats. I let you know after the work. If my task (in the general sense) is CPU bound, it doesn't matter (to me) if the virtualization has a disk I/O penalty. But is it possible to perform some tuning in the work-flow of the VM's to increase some performance or not. If on the other hand, my task is limited by a disk I/O penalty, I'll weigh that into the *total* cost/benefit, and virtualization may not - or may still - be an advantageous choice. Some reasons of slowness will highly helpful. Any guidance is appreciable. Context is king. Thanks best Regards Adarsh Sharma On Mon, Jan 17, 2011 at 10:41 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Everything you emulate you cut X% performance right off the top...
Re: Why Hadoop is slow in Cloud
On 17/01/11 04:11, Adarsh Sharma wrote: Dear all, Yesterday I performed a kind of testing between *Hadoop in Standalone Servers* *Hadoop in Cloud. *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in which one node act as Master ( Namenode , Jobtracker ) and the remaining nodes act as slaves ( Datanodes, Tasktracker ). On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made one Standalone Machine as *Hadoop Master* and the slaves are configured on the VM's in Cloud. I am confused about the stats obtained after the testing. What I concluded that the VM are giving half peformance as compared with Standalone Servers. Interesting stats, nothing that massively surprises me, especially as your benchmarks are very much streaming through datasets. If you were doing something more CPU intensive (graph work, for example), things wouldn't look so bad I've done stuff in this area. http://www.slideshare.net/steve_l/farming-hadoop-inthecloud I am expected some slow down but at this level I never expect. Would this is genuine or there may be some configuration problem. I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in Standalone Servers. Please have a look on the results and if interested comment on it. The big killer here is File IO, with today's HDD controllers and virtual filesystems, disk IO is way underpowered compared to physical disk IO. Networking is reduced (but improving), and CPU can be pretty good, but disk is bad. Why? 1. Every access to a block in the VM is turned into virtual disk controller operations which are then interpreted by the VDC and turned into reads/writes in the virtual disk drive 2. which is turned into seeks, reads and writes in the physical hardware. Some workarounds -allocate physical disks for the HDFS filesystem, for the duration of the VMs. -have the local hosts serve up a bit of their filesystem on a fast protocol (like NFS), and have every VM mount the local physical NFS filestore as their hadoop data dirs.
Re: Why Hadoop is slow in Cloud
On Mon, Jan 17, 2011 at 6:08 AM, Steve Loughran ste...@apache.org wrote: On 17/01/11 04:11, Adarsh Sharma wrote: Dear all, Yesterday I performed a kind of testing between *Hadoop in Standalone Servers* *Hadoop in Cloud. *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in which one node act as Master ( Namenode , Jobtracker ) and the remaining nodes act as slaves ( Datanodes, Tasktracker ). On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made one Standalone Machine as *Hadoop Master* and the slaves are configured on the VM's in Cloud. I am confused about the stats obtained after the testing. What I concluded that the VM are giving half peformance as compared with Standalone Servers. Interesting stats, nothing that massively surprises me, especially as your benchmarks are very much streaming through datasets. If you were doing something more CPU intensive (graph work, for example), things wouldn't look so bad I've done stuff in this area. http://www.slideshare.net/steve_l/farming-hadoop-inthecloud I am expected some slow down but at this level I never expect. Would this is genuine or there may be some configuration problem. I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in Standalone Servers. Please have a look on the results and if interested comment on it. The big killer here is File IO, with today's HDD controllers and virtual filesystems, disk IO is way underpowered compared to physical disk IO. Networking is reduced (but improving), and CPU can be pretty good, but disk is bad. Why? 1. Every access to a block in the VM is turned into virtual disk controller operations which are then interpreted by the VDC and turned into reads/writes in the virtual disk drive 2. which is turned into seeks, reads and writes in the physical hardware. Some workarounds -allocate physical disks for the HDFS filesystem, for the duration of the VMs. -have the local hosts serve up a bit of their filesystem on a fast protocol (like NFS), and have every VM mount the local physical NFS filestore as their hadoop data dirs. Q: Why is my Nintendo emulator slow on a 800 MHZ computer made 10 years after Nintendo? A: Emulation Everything you emulate you cut X% performance right off the top. Emulation is great when you want to run mac on windows or freebsd on linux or nintendo on linux. However most people would do better with technologies that use kernel level isolation such as Linux containers, Solaris Zones, Linux VServer (my favorite) http://linux-vserver.org/, User Mode Linux or similar technologies that ISOLATE rather then EMULATE. Sorry list I feel I rant about this bi-annually. I have just always been so shocked about how many people get lured into cloud and virtualized solutions for better management and near native performance
Why Hadoop is slow in Cloud
Dear all, Yesterday I performed a kind of testing between *Hadoop in Standalone Servers* *Hadoop in Cloud. *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in which one node act as Master ( Namenode , Jobtracker ) and the remaining nodes act as slaves ( Datanodes, Tasktracker ). On the other hand, for testing Hadoop in *Cloud* ( Euclayptus ), I made one Standalone Machine as *Hadoop Master* and the slaves are configured on the VM's in Cloud. I am confused about the stats obtained after the testing. What I concluded that the VM are giving half peformance as compared with Standalone Servers. I am expected some slow down but at this level I never expect. Would this is genuine or there may be some configuration problem. I am using 1 GB (10-1000mb/s) LAN in VM machines and 100mb/s in Standalone Servers. Please have a look on the results and if interested comment on it. Thanks Regards Adarsh Sharma hadoop_testing_new.ods Description: application/vnd.oasis.opendocument.spreadsheet