Re: Hadoop cluster running in cloudstack

Sebastien Goasguen Thu, 06 Jun 2013 07:24:47 -0700

On Jun 6, 2013, at 4:05 AM, Shanker Balan <[email protected]> wrote:


> On 05-Jun-2013, at 12:13 AM, David Ortiz <[email protected]> wrote:
> 
>> Hello,
>>    Has anyone tried running a hadoop cluster in a cloudstack environment?  I 
>> have set one up, but I am finding that I am having some IO contention 
>> between slave nodes on each host since they all share one local storage 
>> pool.  As I understand it, there is not currently a method for using 
>> multiple local storage pools with VMs through cloudstack.  Has anyone found 
>> a workaround for this by any chance?
> 
> 
> Hi David,
> 
> Have you seen Seb's 
> http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata slides yet?

As a quick disclaimer, the various configurations I highlight in this deck are 
a bit hand wavy and I did not test them. I just made a guess about how one 
might want to use the baremetal functionality in cloudstack. The main 
distinction being between using a "big data" store as storage backends of 
cloudstack and using cloudstack to provision a bigdata store on-demand.

-sebastien

> 
> In my experience running Hadoop (100+ nodes) on traditional servers, its 
> going to be really hard to scale up Hadoop workloads using local storage and 
> HDFS on a cloud.
> 
> I ran out of IOPS very quickly. There was enough CPU headroom but could not 
> add more slots as disk became the bottleneck. Every time there was a 
> node/disk failure, rebalancing was a nightmare with a 3x HDFS replication 
> factor. 
> 
> If I were to run Hadoop on an IaaS cloud, I would do it very similar to 
> Amazon AWS EMR - instances backed by a "Storage As A Service" layer (S3) for 
> big data instead of HDFS.
> 
> The system would work as below:
> 
> - Create a dedicated big data storage tier using a distributed filesystem 
> like Gluster/Ceph/Isilon. Most of the vendors now provide S3 compat 
> connectors for Hadoop.
> 
> http://ceph.com/docs/master/cephfs/hadoop/
> http://gluster.org/community/documentation/index.php/Hadoop
> http://www.emc.com/big-data/scale-out-storage-hadoop.htm
> 
> - Hadoop instances are spun up on bare metal or on hypervisors. The service 
> offerings for "big data" instances could will run on dedicated hypervisors 
> (via tags) with high bandwidth network connectivity to the storage service.
> 
> - Hadoop instances use Local storage for run time data.
> 
> - Hadoop VMs connect to the storage tier via connectors for permanent storage
> 
> Benefits:
> 
> - Spinning up/down VMs don't cause HDFS rebalancing as there is no HDFS 
> anywhere.
> 
> - Scale out VMs independently of storage. Add more spindles / nodes to the 
> storage cluster to scale out IOPS and capacity
> 
> - Easy upgrade of Hadoop releases without risk to data
> 
> Regards.
> @shankerbalan
> 
> -- 
> Shanker Balan
> Managing Consultant
> 
> 
> 
> M: +91 98860 60539
> [email protected] | www.shapeblue.com | Twitter:@shapeblue
> ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore - 560 
> 055
> 
> This email and any attachments to it may be confidential and are intended 
> solely for the use of the individual to whom it is addressed. Any views or 
> opinions expressed are solely those of the author and do not necessarily 
> represent those of Shape Blue Ltd or related companies. If you are not the 
> intended recipient of this email, you must neither take any action based upon 
> its contents, nor copy or show it to anyone. Please contact the sender if you 
> believe you have received this email in error. Shape Blue Ltd is a company 
> incorporated in England & Wales. ShapeBlue Services India LLP is operated 
> under license from Shape Blue Ltd. ShapeBlue is a registered trademark.

Re: Hadoop cluster running in cloudstack

Reply via email to