Re: Hadoop cluster running in cloudstack
On Jun 6, 2013, at 4:05 AM, Shanker Balan shanker.ba...@shapeblue.com wrote: On 05-Jun-2013, at 12:13 AM, David Ortiz dpor...@outlook.com wrote: Hello, Has anyone tried running a hadoop cluster in a cloudstack environment? I have set one up, but I am finding that I am having some IO contention between slave nodes on each host since they all share one local storage pool. As I understand it, there is not currently a method for using multiple local storage pools with VMs through cloudstack. Has anyone found a workaround for this by any chance? Hi David, Have you seen Seb's http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata slides yet? As a quick disclaimer, the various configurations I highlight in this deck are a bit hand wavy and I did not test them. I just made a guess about how one might want to use the baremetal functionality in cloudstack. The main distinction being between using a big data store as storage backends of cloudstack and using cloudstack to provision a bigdata store on-demand. -sebastien In my experience running Hadoop (100+ nodes) on traditional servers, its going to be really hard to scale up Hadoop workloads using local storage and HDFS on a cloud. I ran out of IOPS very quickly. There was enough CPU headroom but could not add more slots as disk became the bottleneck. Every time there was a node/disk failure, rebalancing was a nightmare with a 3x HDFS replication factor. If I were to run Hadoop on an IaaS cloud, I would do it very similar to Amazon AWS EMR - instances backed by a Storage As A Service layer (S3) for big data instead of HDFS. The system would work as below: - Create a dedicated big data storage tier using a distributed filesystem like Gluster/Ceph/Isilon. Most of the vendors now provide S3 compat connectors for Hadoop. http://ceph.com/docs/master/cephfs/hadoop/ http://gluster.org/community/documentation/index.php/Hadoop http://www.emc.com/big-data/scale-out-storage-hadoop.htm - Hadoop instances are spun up on bare metal or on hypervisors. The service offerings for big data instances could will run on dedicated hypervisors (via tags) with high bandwidth network connectivity to the storage service. - Hadoop instances use Local storage for run time data. - Hadoop VMs connect to the storage tier via connectors for permanent storage Benefits: - Spinning up/down VMs don't cause HDFS rebalancing as there is no HDFS anywhere. - Scale out VMs independently of storage. Add more spindles / nodes to the storage cluster to scale out IOPS and capacity - Easy upgrade of Hadoop releases without risk to data Regards. @shankerbalan -- Shanker Balan Managing Consultant M: +91 98860 60539 shanker.ba...@shapeblue.com | www.shapeblue.com | Twitter:@shapeblue ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore - 560 055 This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.
RE: Hadoop cluster running in cloudstack
Chiradeep, Currently I am working with KVM hypervisor nodes. The use case of having 4 spindles and assigning one to each node is exactly what I would like to do. For the moment I have all four spindles configured in a RAID with the cloudstack local storage pointed at it. Shanker, I had not seen that slideshow yet, so thank you for pointing me to it. As of now, the hadoop resources I am using are statically allocated between 4 hosts. As it stands now, I am constrained to those resources without the ability to add any additional storage cluster (or additional storage to my current shared storage appliance), or additional nodes. Fortunately, my use cases don't require any kind of reallocation of the hadoop nodes. It's more clients for the cluster as well as web service nodes that run clients that are being dynamically spun up and down. I have found that I can get through my jobs alright, they just take a lot of extra time to run since I have the storage acting as a bottleneck right now. Thanks, David Ortiz From: run...@gmail.com Subject: Re: Hadoop cluster running in cloudstack Date: Thu, 6 Jun 2013 10:23:50 -0400 To: users@cloudstack.apache.org On Jun 6, 2013, at 4:05 AM, Shanker Balan shanker.ba...@shapeblue.com wrote: On 05-Jun-2013, at 12:13 AM, David Ortiz dpor...@outlook.com wrote: Hello, Has anyone tried running a hadoop cluster in a cloudstack environment? I have set one up, but I am finding that I am having some IO contention between slave nodes on each host since they all share one local storage pool. As I understand it, there is not currently a method for using multiple local storage pools with VMs through cloudstack. Has anyone found a workaround for this by any chance? Hi David, Have you seen Seb's http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata slides yet? As a quick disclaimer, the various configurations I highlight in this deck are a bit hand wavy and I did not test them. I just made a guess about how one might want to use the baremetal functionality in cloudstack. The main distinction being between using a big data store as storage backends of cloudstack and using cloudstack to provision a bigdata store on-demand. -sebastien In my experience running Hadoop (100+ nodes) on traditional servers, its going to be really hard to scale up Hadoop workloads using local storage and HDFS on a cloud. I ran out of IOPS very quickly. There was enough CPU headroom but could not add more slots as disk became the bottleneck. Every time there was a node/disk failure, rebalancing was a nightmare with a 3x HDFS replication factor. If I were to run Hadoop on an IaaS cloud, I would do it very similar to Amazon AWS EMR - instances backed by a Storage As A Service layer (S3) for big data instead of HDFS. The system would work as below: - Create a dedicated big data storage tier using a distributed filesystem like Gluster/Ceph/Isilon. Most of the vendors now provide S3 compat connectors for Hadoop. http://ceph.com/docs/master/cephfs/hadoop/ http://gluster.org/community/documentation/index.php/Hadoop http://www.emc.com/big-data/scale-out-storage-hadoop.htm - Hadoop instances are spun up on bare metal or on hypervisors. The service offerings for big data instances could will run on dedicated hypervisors (via tags) with high bandwidth network connectivity to the storage service. - Hadoop instances use Local storage for run time data. - Hadoop VMs connect to the storage tier via connectors for permanent storage Benefits: - Spinning up/down VMs don't cause HDFS rebalancing as there is no HDFS anywhere. - Scale out VMs independently of storage. Add more spindles / nodes to the storage cluster to scale out IOPS and capacity - Easy upgrade of Hadoop releases without risk to data Regards. @shankerbalan -- Shanker Balan Managing Consultant M: +91 98860 60539 shanker.ba...@shapeblue.com | www.shapeblue.com | Twitter:@shapeblue ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore - 560 055 This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England Wales. ShapeBlue Services India LLP is operated under license from Shape Blue Ltd. ShapeBlue is a registered trademark.
Hadoop cluster running in cloudstack
Hello, Has anyone tried running a hadoop cluster in a cloudstack environment? I have set one up, but I am finding that I am having some IO contention between slave nodes on each host since they all share one local storage pool. As I understand it, there is not currently a method for using multiple local storage pools with VMs through cloudstack. Has anyone found a workaround for this by any chance? Thanks, David Ortiz
Re: Hadoop cluster running in cloudstack
This is a very interesting question - I know I've asked for better control over where VMs end up going, specifically to be able to ensure rack locality for Hadoop nodes, but I don't know what the progress has been on that, nor do I know whether there's a way of doing multiple storage pools on a single host short of some really silly jumping through hoops, like running multiple cloud agents on a single host. But I could be wrong - I'm not as in touch with the internals as others may be. A. On Tue, Jun 4, 2013 at 11:43 AM, David Ortiz dpor...@outlook.com wrote: Hello, Has anyone tried running a hadoop cluster in a cloudstack environment? I have set one up, but I am finding that I am having some IO contention between slave nodes on each host since they all share one local storage pool. As I understand it, there is not currently a method for using multiple local storage pools with VMs through cloudstack. Has anyone found a workaround for this by any chance? Thanks, David Ortiz