Re: Hadoop cluster running in cloudstack

Chiradeep Vittal Tue, 11 Jun 2013 13:07:58 -0700

Taking it to dev@ to see if there is any interest.


It is a good and interesting requirement. I can see hacking 'pre-setup'
storage with tags to achieve this, but it is going to be a fragile hack.
I believe GCE also has the concept of some instance types having dedicated
spindles.


On 6/6/13 11:14 AM, "David Ortiz" <dpor...@outlook.com> wrote:

>Chiradeep,
>     Currently I am working with KVM hypervisor nodes.  The use case of
>having 4 spindles and assigning one to each node is exactly what I would
>like to do.  For the moment I have all four spindles configured in a RAID
>with the cloudstack local storage pointed at it.
>Shanker,
>      I had not seen that slideshow yet, so thank you for pointing me to
>it.  As of now, the hadoop resources I am using are statically allocated
>between 4 hosts.  As it stands now, I am constrained to those resources
>without the ability to add any additional storage cluster (or additional
>storage to my current shared storage appliance), or additional nodes.
>Fortunately, my use cases don't require any kind of reallocation of the
>hadoop nodes.  It's more clients for the cluster as well as web service
>nodes that run clients that are being dynamically spun up and down.  I
>have found that I can get through my jobs alright, they just take a lot
>of extra time to run since I have the storage acting as a bottleneck
>right now.
>Thanks,     David Ortiz
>
>> From: run...@gmail.com
>> Subject: Re: Hadoop cluster running in cloudstack
>> Date: Thu, 6 Jun 2013 10:23:50 -0400
>> To: us...@cloudstack.apache.org
>> 
>> 
>> On Jun 6, 2013, at 4:05 AM, Shanker Balan <shanker.ba...@shapeblue.com>
>>wrote:
>> 
>> > On 05-Jun-2013, at 12:13 AM, David Ortiz <dpor...@outlook.com> wrote:
>> > 
>> >> Hello,
>> >>    Has anyone tried running a hadoop cluster in a cloudstack
>>environment?  I have set one up, but I am finding that I am having some
>>IO contention between slave nodes on each host since they all share one
>>local storage pool.  As I understand it, there is not currently a method
>>for using multiple local storage pools with VMs through cloudstack.  Has
>>anyone found a workaround for this by any chance?
>> > 
>> > 
>> > Hi David,
>> > 
>> > Have you seen Seb's
>>http://www.slideshare.net/sebastiengoasguen/cloudstack-and-bigdata
>>slides yet?
>> 
>> As a quick disclaimer, the various configurations I highlight in this
>>deck are a bit hand wavy and I did not test them. I just made a guess
>>about how one might want to use the baremetal functionality in
>>cloudstack. The main distinction being between using a "big data" store
>>as storage backends of cloudstack and using cloudstack to provision a
>>bigdata store on-demand.
>> 
>> -sebastien
>> 
>> > 
>> > In my experience running Hadoop (100+ nodes) on traditional servers,
>>its going to be really hard to scale up Hadoop workloads using local
>>storage and HDFS on a cloud.
>> > 
>> > I ran out of IOPS very quickly. There was enough CPU headroom but
>>could not add more slots as disk became the bottleneck. Every time there
>>was a node/disk failure, rebalancing was a nightmare with a 3x HDFS
>>replication factor.
>> > 
>> > If I were to run Hadoop on an IaaS cloud, I would do it very similar
>>to Amazon AWS EMR - instances backed by a "Storage As A Service" layer
>>(S3) for big data instead of HDFS.
>> > 
>> > The system would work as below:
>> > 
>> > - Create a dedicated big data storage tier using a distributed
>>filesystem like Gluster/Ceph/Isilon. Most of the vendors now provide S3
>>compat connectors for Hadoop.
>> > 
>> > http://ceph.com/docs/master/cephfs/hadoop/
>> > http://gluster.org/community/documentation/index.php/Hadoop
>> > http://www.emc.com/big-data/scale-out-storage-hadoop.htm
>> > 
>> > - Hadoop instances are spun up on bare metal or on hypervisors. The
>>service offerings for "big data" instances could will run on dedicated
>>hypervisors (via tags) with high bandwidth network connectivity to the
>>storage service.
>> > 
>> > - Hadoop instances use Local storage for run time data.
>> > 
>> > - Hadoop VMs connect to the storage tier via connectors for permanent
>>storage
>> > 
>> > Benefits:
>> > 
>> > - Spinning up/down VMs don't cause HDFS rebalancing as there is no
>>HDFS anywhere.
>> > 
>> > - Scale out VMs independently of storage. Add more spindles / nodes
>>to the storage cluster to scale out IOPS and capacity
>> > 
>> > - Easy upgrade of Hadoop releases without risk to data
>> > 
>> > Regards.
>> > @shankerbalan
>> > 
>> > -- 
>> > Shanker Balan
>> > Managing Consultant
>> > 
>> > 
>> > 
>> > M: +91 98860 60539
>> > shanker.ba...@shapeblue.com | www.shapeblue.com | Twitter:@shapeblue
>> > ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre,
>>Bangalore - 560 055
>> > 
>> > This email and any attachments to it may be confidential and are
>>intended solely for the use of the individual to whom it is addressed.
>>Any views or opinions expressed are solely those of the author and do
>>not necessarily represent those of Shape Blue Ltd or related companies.
>>If you are not the intended recipient of this email, you must neither
>>take any action based upon its contents, nor copy or show it to anyone.
>>Please contact the sender if you believe you have received this email in
>>error. Shape Blue Ltd is a company incorporated in England & Wales.
>>ShapeBlue Services India LLP is operated under license from Shape Blue
>>Ltd. ShapeBlue is a registered trademark.
>> 
>

Re: Hadoop cluster running in cloudstack

Reply via email to