Re: Hadoop cluster running in cloudstack

2013-06-06 Thread Sebastien Goasguen

On Jun 6, 2013, at 4:05 AM, Shanker Balan shanker.ba...@shapeblue.com wrote:

 On 05-Jun-2013, at 12:13 AM, David Ortiz dpor...@outlook.com wrote:
 
 Hello,
Has anyone tried running a hadoop cluster in a cloudstack environment?  I 
 have set one up, but I am finding that I am having some IO contention 
 between slave nodes on each host since they all share one local storage 
 pool.  As I understand it, there is not currently a method for using 
 multiple local storage pools with VMs through cloudstack.  Has anyone found 
 a workaround for this by any chance?
 
 
 Hi David,
 
 Have you seen Seb's 
 http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata slides yet?

As a quick disclaimer, the various configurations I highlight in this deck are 
a bit hand wavy and I did not test them. I just made a guess about how one 
might want to use the baremetal functionality in cloudstack. The main 
distinction being between using a big data store as storage backends of 
cloudstack and using cloudstack to provision a bigdata store on-demand.

-sebastien

 
 In my experience running Hadoop (100+ nodes) on traditional servers, its 
 going to be really hard to scale up Hadoop workloads using local storage and 
 HDFS on a cloud.
 
 I ran out of IOPS very quickly. There was enough CPU headroom but could not 
 add more slots as disk became the bottleneck. Every time there was a 
 node/disk failure, rebalancing was a nightmare with a 3x HDFS replication 
 factor. 
 
 If I were to run Hadoop on an IaaS cloud, I would do it very similar to 
 Amazon AWS EMR - instances backed by a Storage As A Service layer (S3) for 
 big data instead of HDFS.
 
 The system would work as below:
 
 - Create a dedicated big data storage tier using a distributed filesystem 
 like Gluster/Ceph/Isilon. Most of the vendors now provide S3 compat 
 connectors for Hadoop.
 
 http://ceph.com/docs/master/cephfs/hadoop/
 http://gluster.org/community/documentation/index.php/Hadoop
 http://www.emc.com/big-data/scale-out-storage-hadoop.htm
 
 - Hadoop instances are spun up on bare metal or on hypervisors. The service 
 offerings for big data instances could will run on dedicated hypervisors 
 (via tags) with high bandwidth network connectivity to the storage service.
 
 - Hadoop instances use Local storage for run time data.
 
 - Hadoop VMs connect to the storage tier via connectors for permanent storage
 
 Benefits:
 
 - Spinning up/down VMs don't cause HDFS rebalancing as there is no HDFS 
 anywhere.
 
 - Scale out VMs independently of storage. Add more spindles / nodes to the 
 storage cluster to scale out IOPS and capacity
 
 - Easy upgrade of Hadoop releases without risk to data
 
 Regards.
 @shankerbalan
 
 -- 
 Shanker Balan
 Managing Consultant
 
 
 
 M: +91 98860 60539
 shanker.ba...@shapeblue.com | www.shapeblue.com | Twitter:@shapeblue
 ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore - 560 
 055
 
 This email and any attachments to it may be confidential and are intended 
 solely for the use of the individual to whom it is addressed. Any views or 
 opinions expressed are solely those of the author and do not necessarily 
 represent those of Shape Blue Ltd or related companies. If you are not the 
 intended recipient of this email, you must neither take any action based upon 
 its contents, nor copy or show it to anyone. Please contact the sender if you 
 believe you have received this email in error. Shape Blue Ltd is a company 
 incorporated in England  Wales. ShapeBlue Services India LLP is operated 
 under license from Shape Blue Ltd. ShapeBlue is a registered trademark.



RE: Hadoop cluster running in cloudstack

2013-06-06 Thread David Ortiz
Chiradeep,
 Currently I am working with KVM hypervisor nodes.  The use case of having 
4 spindles and assigning one to each node is exactly what I would like to do.  
For the moment I have all four spindles configured in a RAID with the 
cloudstack local storage pointed at it.
Shanker,
  I had not seen that slideshow yet, so thank you for pointing me to it.  
As of now, the hadoop resources I am using are statically allocated between 4 
hosts.  As it stands now, I am constrained to those resources without the 
ability to add any additional storage cluster (or additional storage to my 
current shared storage appliance), or additional nodes.  Fortunately, my use 
cases don't require any kind of reallocation of the hadoop nodes.  It's more 
clients for the cluster as well as web service nodes that run clients that are 
being dynamically spun up and down.  I have found that I can get through my 
jobs alright, they just take a lot of extra time to run since I have the 
storage acting as a bottleneck right now.
Thanks, David Ortiz

 From: run...@gmail.com
 Subject: Re: Hadoop cluster running in cloudstack
 Date: Thu, 6 Jun 2013 10:23:50 -0400
 To: users@cloudstack.apache.org
 
 
 On Jun 6, 2013, at 4:05 AM, Shanker Balan shanker.ba...@shapeblue.com wrote:
 
  On 05-Jun-2013, at 12:13 AM, David Ortiz dpor...@outlook.com wrote:
  
  Hello,
 Has anyone tried running a hadoop cluster in a cloudstack environment?  
  I have set one up, but I am finding that I am having some IO contention 
  between slave nodes on each host since they all share one local storage 
  pool.  As I understand it, there is not currently a method for using 
  multiple local storage pools with VMs through cloudstack.  Has anyone 
  found a workaround for this by any chance?
  
  
  Hi David,
  
  Have you seen Seb's 
  http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata slides 
  yet?
 
 As a quick disclaimer, the various configurations I highlight in this deck 
 are a bit hand wavy and I did not test them. I just made a guess about how 
 one might want to use the baremetal functionality in cloudstack. The main 
 distinction being between using a big data store as storage backends of 
 cloudstack and using cloudstack to provision a bigdata store on-demand.
 
 -sebastien
 
  
  In my experience running Hadoop (100+ nodes) on traditional servers, its 
  going to be really hard to scale up Hadoop workloads using local storage 
  and HDFS on a cloud.
  
  I ran out of IOPS very quickly. There was enough CPU headroom but could not 
  add more slots as disk became the bottleneck. Every time there was a 
  node/disk failure, rebalancing was a nightmare with a 3x HDFS replication 
  factor. 
  
  If I were to run Hadoop on an IaaS cloud, I would do it very similar to 
  Amazon AWS EMR - instances backed by a Storage As A Service layer (S3) 
  for big data instead of HDFS.
  
  The system would work as below:
  
  - Create a dedicated big data storage tier using a distributed filesystem 
  like Gluster/Ceph/Isilon. Most of the vendors now provide S3 compat 
  connectors for Hadoop.
  
  http://ceph.com/docs/master/cephfs/hadoop/
  http://gluster.org/community/documentation/index.php/Hadoop
  http://www.emc.com/big-data/scale-out-storage-hadoop.htm
  
  - Hadoop instances are spun up on bare metal or on hypervisors. The service 
  offerings for big data instances could will run on dedicated hypervisors 
  (via tags) with high bandwidth network connectivity to the storage service.
  
  - Hadoop instances use Local storage for run time data.
  
  - Hadoop VMs connect to the storage tier via connectors for permanent 
  storage
  
  Benefits:
  
  - Spinning up/down VMs don't cause HDFS rebalancing as there is no HDFS 
  anywhere.
  
  - Scale out VMs independently of storage. Add more spindles / nodes to the 
  storage cluster to scale out IOPS and capacity
  
  - Easy upgrade of Hadoop releases without risk to data
  
  Regards.
  @shankerbalan
  
  -- 
  Shanker Balan
  Managing Consultant
  
  
  
  M: +91 98860 60539
  shanker.ba...@shapeblue.com | www.shapeblue.com | Twitter:@shapeblue
  ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore - 
  560 055
  
  This email and any attachments to it may be confidential and are intended 
  solely for the use of the individual to whom it is addressed. Any views or 
  opinions expressed are solely those of the author and do not necessarily 
  represent those of Shape Blue Ltd or related companies. If you are not the 
  intended recipient of this email, you must neither take any action based 
  upon its contents, nor copy or show it to anyone. Please contact the sender 
  if you believe you have received this email in error. Shape Blue Ltd is a 
  company incorporated in England  Wales. ShapeBlue Services India LLP is 
  operated under license from Shape Blue Ltd. ShapeBlue is a registered 
  trademark.
 
  

Hadoop cluster running in cloudstack

2013-06-04 Thread David Ortiz
Hello,
Has anyone tried running a hadoop cluster in a cloudstack environment?  I 
have set one up, but I am finding that I am having some IO contention between 
slave nodes on each host since they all share one local storage pool.  As I 
understand it, there is not currently a method for using multiple local storage 
pools with VMs through cloudstack.  Has anyone found a workaround for this by 
any chance?
Thanks, David Ortiz   

Re: Hadoop cluster running in cloudstack

2013-06-04 Thread Andrew Bayer
This is a very interesting question - I know I've asked for better control
over where VMs end up going, specifically to be able to ensure rack
locality for Hadoop nodes, but I don't know what the progress has been on
that, nor do I know whether there's a way of doing multiple storage pools
on a single host short of some really silly jumping through hoops, like
running multiple cloud agents on a single host. But I could be wrong - I'm
not as in touch with the internals as others may be.

A.

On Tue, Jun 4, 2013 at 11:43 AM, David Ortiz dpor...@outlook.com wrote:

 Hello,
 Has anyone tried running a hadoop cluster in a cloudstack environment?
  I have set one up, but I am finding that I am having some IO contention
 between slave nodes on each host since they all share one local storage
 pool.  As I understand it, there is not currently a method for using
 multiple local storage pools with VMs through cloudstack.  Has anyone found
 a workaround for this by any chance?
 Thanks, David Ortiz