On Jun 6, 2013, at 4:05 AM, Shanker Balan <shanker.ba...@shapeblue.com> wrote:
> On 05-Jun-2013, at 12:13 AM, David Ortiz <dpor...@outlook.com> wrote: > >> Hello, >> Has anyone tried running a hadoop cluster in a cloudstack environment? I >> have set one up, but I am finding that I am having some IO contention >> between slave nodes on each host since they all share one local storage >> pool. As I understand it, there is not currently a method for using >> multiple local storage pools with VMs through cloudstack. Has anyone found >> a workaround for this by any chance? > > > Hi David, > > Have you seen Seb's > http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata slides yet? As a quick disclaimer, the various configurations I highlight in this deck are a bit hand wavy and I did not test them. I just made a guess about how one might want to use the baremetal functionality in cloudstack. The main distinction being between using a "big data" store as storage backends of cloudstack and using cloudstack to provision a bigdata store on-demand. -sebastien > > In my experience running Hadoop (100+ nodes) on traditional servers, its > going to be really hard to scale up Hadoop workloads using local storage and > HDFS on a cloud. > > I ran out of IOPS very quickly. There was enough CPU headroom but could not > add more slots as disk became the bottleneck. Every time there was a > node/disk failure, rebalancing was a nightmare with a 3x HDFS replication > factor. > > If I were to run Hadoop on an IaaS cloud, I would do it very similar to > Amazon AWS EMR - instances backed by a "Storage As A Service" layer (S3) for > big data instead of HDFS. > > The system would work as below: > > - Create a dedicated big data storage tier using a distributed filesystem > like Gluster/Ceph/Isilon. Most of the vendors now provide S3 compat > connectors for Hadoop. > > http://ceph.com/docs/master/cephfs/hadoop/ > http://gluster.org/community/documentation/index.php/Hadoop > http://www.emc.com/big-data/scale-out-storage-hadoop.htm > > - Hadoop instances are spun up on bare metal or on hypervisors. The service > offerings for "big data" instances could will run on dedicated hypervisors > (via tags) with high bandwidth network connectivity to the storage service. > > - Hadoop instances use Local storage for run time data. > > - Hadoop VMs connect to the storage tier via connectors for permanent storage > > Benefits: > > - Spinning up/down VMs don't cause HDFS rebalancing as there is no HDFS > anywhere. > > - Scale out VMs independently of storage. Add more spindles / nodes to the > storage cluster to scale out IOPS and capacity > > - Easy upgrade of Hadoop releases without risk to data > > Regards. > @shankerbalan > > -- > Shanker Balan > Managing Consultant > > > > M: +91 98860 60539 > shanker.ba...@shapeblue.com | www.shapeblue.com | Twitter:@shapeblue > ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore - 560 > 055 > > This email and any attachments to it may be confidential and are intended > solely for the use of the individual to whom it is addressed. Any views or > opinions expressed are solely those of the author and do not necessarily > represent those of Shape Blue Ltd or related companies. If you are not the > intended recipient of this email, you must neither take any action based upon > its contents, nor copy or show it to anyone. Please contact the sender if you > believe you have received this email in error. Shape Blue Ltd is a company > incorporated in England & Wales. ShapeBlue Services India LLP is operated > under license from Shape Blue Ltd. ShapeBlue is a registered trademark.