Re: Hadoop Distributed Virtualisation

2008-06-09 Thread Steve Loughran

Colin Freas wrote:

I've wondered about this using single or dual quad-core machines with one
spindle per core, and partitioning them out into 2, 4, 8, whatever virtual
machines, possibly marking each physical box as a rack.


Or just host VM images with multi-cpu support and make it one 'machine' 
with the appopriate # of task trackers.




There would be some initial and ongoing sysadmin costs.  But could this
increase thoughput on a small cluster, of 2 or 3 boxes with 16 or 24 cores,
with many jobs by limiting the number of cores each job runs on, to say 8?
Has anyone tried such a setup?



I can see reasons for virtualisation -better isolation and security- but 
dont think performance would improve. More likely anything 
clock-sensitive will get confused if, under load, some VMs get less cpu 
time than they expect.


A better approach could be to improve the schedulers in hadoop to put 
work where it is best, maybe even move if if things are taking too long, 
etc.


Re: Hadoop Distributed Virtualisation

2008-06-06 Thread Brad C
Hi Colin,

I think this would work as a lab setup for testing how hadoop handles
hardware failures but I was thinking of the opposite on a much larger
scale.

It is though by many IT individuals that its smart to take one high
end system and split into multiple machines, though few people think
of taking multiple machines and integrating them into one and then
virtualising them apart again. This has been done though end data
storage has almost always resided on a SAN which has a cost
implication. I know Hadoops Distributed Filesystem is geared towards
batch processing and larger data sets with higher latency, though has
anyone every tried integrating it with a virtual server tech like kvm?
The other possibility is that Im nuts and have completely lost the plot.

Kind Regards

Brad

On Fri, Jun 6, 2008 at 5:03 PM, Colin Freas [EMAIL PROTECTED] wrote:
 I've wondered about this using single or dual quad-core machines with one
 spindle per core, and partitioning them out into 2, 4, 8, whatever virtual
 machines, possibly marking each physical box as a rack.

 There would be some initial and ongoing sysadmin costs.  But could this
 increase thoughput on a small cluster, of 2 or 3 boxes with 16 or 24 cores,
 with many jobs by limiting the number of cores each job runs on, to say 8?
 Has anyone tried such a setup?


 On Fri, Jun 6, 2008 at 10:30 AM, Brad C [EMAIL PROTECTED] wrote:

 Hello Everyone,

 I've been brainstorming recently and its always been in the back of my
 mind, hadoop offers the functionality of clustering comodity systems
 together, but how would one go about virtualising them apart again?

 Kind Regards

 Brad :)




Re: Hadoop Distributed Virtualisation

2008-06-06 Thread Edward Capriolo
I once asked a wise man in change of a rather large multi-datacenter
service, Have you every considered virtualization? He replied, All
the CPU's here are pegged at 100%

They may be applications for this type of processing. I have thought
about systems like this from time to time. This thinking goes in
circles. Hadoop is designed for storing and processing on different
hardware.  Virtualization lets you split a system into sub-systems.

Virtualization is great for proof of concept.
For example, I have deployed this: I installed VMware with two linux
systems on my windows host, I followed a hadoop multi-system-tutorial
running on two vmware nodes. I was able to get the word count
application working, I also confirmed that blocks were indeed being
stored on both virtual systems and that processing was being shared
via MAP/REDUCE.

The processing however was slow, of course this is the fault of
VMware. VMware has a very high emulation overhead. Xen has less
overhead. LinuxVserver and OpenVZ use software virtualization (they
have very little (almost no) overhead). Regardless of how much
overhead, overhead is overhead. Personally I find the Vmware falls
short of its promises


Re: Hadoop Distributed Virtualisation

2008-06-06 Thread Colin Freas
The MR jobs I'm performing are not CPU intensive, so I've always assumed
that they're more IO bound.  Maybe that's an exceptional situation, but I'm
not really sure.

A good motherboard with a local IO channel per disk, feeding individual
cores, with memory partitioned up between them...  and I've heard good
things about Intel's next tock vis-a-vis internal system throughput.

And yes, this would be a task for a paravirtualization system like Xen.
Again, it's just a thought, but with low end quad core proc's running about
$300, and the potential to cut the number of machines you need to physically
setup by 75%, I'm not sure I'd say it'd only be good for a proof of
concept.

Also, I just set up a dozen odd boxes that are two generations behind modern
boxes, and promptly blew a fuse.  The TDP on the Xeon 3.06Ghz chips I'm
using is 89W.  The TDP on an Intel Q6600 is 65W, and it represents 4 cores.

It's a simple experiment, but I don't have the resources on hand to run it.
I'm curious if anyone has seen the performance impact from the different
setups we're talking about.  I also think you could come close to faking it
with Hadoop config changes.

-Colin


On Fri, Jun 6, 2008 at 12:41 PM, Edward Capriolo [EMAIL PROTECTED]
wrote:

 I once asked a wise man in change of a rather large multi-datacenter
 service, Have you every considered virtualization? He replied, All
 the CPU's here are pegged at 100%

 They may be applications for this type of processing. I have thought
 about systems like this from time to time. This thinking goes in
 circles. Hadoop is designed for storing and processing on different
 hardware.  Virtualization lets you split a system into sub-systems.

 Virtualization is great for proof of concept.
 For example, I have deployed this: I installed VMware with two linux
 systems on my windows host, I followed a hadoop multi-system-tutorial
 running on two vmware nodes. I was able to get the word count
 application working, I also confirmed that blocks were indeed being
 stored on both virtual systems and that processing was being shared
 via MAP/REDUCE.

 The processing however was slow, of course this is the fault of
 VMware. VMware has a very high emulation overhead. Xen has less
 overhead. LinuxVserver and OpenVZ use software virtualization (they
 have very little (almost no) overhead). Regardless of how much
 overhead, overhead is overhead. Personally I find the Vmware falls
 short of its promises