Re: Hadoop cluster running in cloudstack

Shanker Balan Thu, 06 Jun 2013 01:06:55 -0700

On 05-Jun-2013, at 12:13 AM, David Ortiz 
<[email protected]<mailto:[email protected]>> wrote:


Hello,
   Has anyone tried running a hadoop cluster in a cloudstack environment?  I 
have set one up, but I am finding that I am having some IO contention between 
slave nodes on each host since they all share one local storage pool.  As I 
understand it, there is not currently a method for using multiple local storage 
pools with VMs through cloudstack.  Has anyone found a workaround for this by 
any chance?


Hi David,

Have you seen Seb's 
http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata slides yet?

In my experience running Hadoop (100+ nodes) on traditional servers, its going 
to be really hard to scale up Hadoop workloads using local storage and HDFS on 
a cloud.

I ran out of IOPS very quickly. There was enough CPU headroom but could not add 
more slots as disk became the bottleneck. Every time there was a node/disk 
failure, rebalancing was a nightmare with a 3x HDFS replication factor.

If I were to run Hadoop on an IaaS cloud, I would do it very similar to Amazon 
AWS EMR - instances backed by a "Storage As A Service" layer (S3) for big data 
instead of HDFS.

The system would work as below:

- Create a dedicated big data storage tier using a distributed filesystem like 
Gluster/Ceph/Isilon. Most of the vendors now provide S3 compat connectors for 
Hadoop.

http://ceph.com/docs/master/cephfs/hadoop/
http://gluster.org/community/documentation/index.php/Hadoop
http://www.emc.com/big-data/scale-out-storage-hadoop.htm

- Hadoop instances are spun up on bare metal or on hypervisors. The service 
offerings for "big data" instances could will run on dedicated hypervisors (via 
tags) with high bandwidth network connectivity to the storage service.

- Hadoop instances use Local storage for run time data.

- Hadoop VMs connect to the storage tier via connectors for permanent storage

Benefits:

- Spinning up/down VMs don't cause HDFS rebalancing as there is no HDFS 
anywhere.

- Scale out VMs independently of storage. Add more spindles / nodes to the 
storage cluster to scale out IOPS and capacity

- Easy upgrade of Hadoop releases without risk to data

Regards.
@shankerbalan

--
Shanker Balan
Managing Consultant

[cid:E7CE8425-E245-4C99-B967-713DF2967392@local]

M: +91 98860 60539
[email protected]<mailto:[email protected]> | 
www.shapeblue.com<http://www.shapeblue.com> | Twitter:@shapeblue
ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore - 560 055

This email and any attachments to it may be confidential and are intended 
solely for the use of the individual to whom it is addressed. Any views or 
opinions expressed are solely those of the author and do not necessarily 
represent those of Shape Blue Ltd or related companies. If you are not the 
intended recipient of this email, you must neither take any action based upon 
its contents, nor copy or show it to anyone. Please contact the sender if you 
believe you have received this email in error. Shape Blue Ltd is a company 
incorporated in England & Wales. ShapeBlue Services India LLP is operated under 
license from Shape Blue Ltd. ShapeBlue is a registered trademark.

Re: Hadoop cluster running in cloudstack

Reply via email to