Re: Hadoop Distributed Virtualisation
Colin Freas wrote: I've wondered about this using single or dual quad-core machines with one spindle per core, and partitioning them out into 2, 4, 8, whatever virtual machines, possibly marking each physical box as a "rack". Or just host VM images with multi-cpu support and make it one 'machine' with the appopriate # of task trackers. There would be some initial and ongoing sysadmin costs. But could this increase thoughput on a small cluster, of 2 or 3 boxes with 16 or 24 cores, with many jobs by limiting the number of cores each job runs on, to say 8? Has anyone tried such a setup? I can see reasons for virtualisation -better isolation and security- but dont think performance would improve. More likely anything clock-sensitive will get confused if, under load, some VMs get less cpu time than they expect. A better approach could be to improve the schedulers in hadoop to put work where it is best, maybe even move if if things are taking too long, etc.
RE: Hadoop Distributed Virtualisation
I see the VM approach great for isolation, customized hadoop or tools required by the jobs and ease of IT management. Performance hit on CPU and IO is there but I never looked at the numbers. Anybody did? Basically for now, on EC2 for instance, if you need to go faster, just buy 50 more machines for a couple of hours... ;)
Re: Hadoop Distributed Virtualisation
The MR jobs I'm performing are not CPU intensive, so I've always assumed that they're more IO bound. Maybe that's an exceptional situation, but I'm not really sure. A good motherboard with a local IO channel per disk, feeding individual cores, with memory partitioned up between them... and I've heard good things about Intel's next tock vis-a-vis internal system throughput. And yes, this would be a task for a paravirtualization system like Xen. Again, it's just a thought, but with low end quad core proc's running about $300, and the potential to cut the number of machines you need to physically setup by 75%, I'm not sure I'd say it'd only be good for a proof of concept. Also, I just set up a dozen odd boxes that are two generations behind modern boxes, and promptly blew a fuse. The TDP on the Xeon 3.06Ghz chips I'm using is 89W. The TDP on an Intel Q6600 is 65W, and it represents 4 cores. It's a simple experiment, but I don't have the resources on hand to run it. I'm curious if anyone has seen the performance impact from the different setups we're talking about. I also think you could come close to faking it with Hadoop config changes. -Colin On Fri, Jun 6, 2008 at 12:41 PM, Edward Capriolo <[EMAIL PROTECTED]> wrote: > I once asked a wise man in change of a rather large multi-datacenter > service, "Have you every considered virtualization?" He replied, "All > the CPU's here are pegged at 100%" > > They may be applications for this type of processing. I have thought > about systems like this from time to time. This thinking goes in > circles. Hadoop is designed for storing and processing on different > hardware. Virtualization lets you split a system into sub-systems. > > Virtualization is great for proof of concept. > For example, I have deployed this: I installed VMware with two linux > systems on my windows host, I followed a hadoop multi-system-tutorial > running on two vmware nodes. I was able to get the word count > application working, I also confirmed that blocks were indeed being > stored on both virtual systems and that processing was being shared > via MAP/REDUCE. > > The processing however was slow, of course this is the fault of > VMware. VMware has a very high emulation overhead. Xen has less > overhead. LinuxVserver and OpenVZ use software virtualization (they > have very little (almost no) overhead). Regardless of how much > overhead, overhead is overhead. Personally I find the Vmware falls > short of its promises >
Re: Hadoop Distributed Virtualisation
There is a place for virtualization. If you can justify the overhead for hands free network management. If your systems really have enough power they can run hadoop and something else. If you need to run multiple truly isolated versions of hadoop (selling some type of hadoop grid services?) If you are working on the type of application that requires a system like hadoop, the implication is that you have a large network of physical hardware and you need all the power you can get. It is counter intuitive to start splitting up hosts into virtual machines, however every application is different and it may make sense.
Re: Hadoop Distributed Virtualisation
I once asked a wise man in change of a rather large multi-datacenter service, "Have you every considered virtualization?" He replied, "All the CPU's here are pegged at 100%" They may be applications for this type of processing. I have thought about systems like this from time to time. This thinking goes in circles. Hadoop is designed for storing and processing on different hardware. Virtualization lets you split a system into sub-systems. Virtualization is great for proof of concept. For example, I have deployed this: I installed VMware with two linux systems on my windows host, I followed a hadoop multi-system-tutorial running on two vmware nodes. I was able to get the word count application working, I also confirmed that blocks were indeed being stored on both virtual systems and that processing was being shared via MAP/REDUCE. The processing however was slow, of course this is the fault of VMware. VMware has a very high emulation overhead. Xen has less overhead. LinuxVserver and OpenVZ use software virtualization (they have very little (almost no) overhead). Regardless of how much overhead, overhead is overhead. Personally I find the Vmware falls short of its promises
Re: Hadoop Distributed Virtualisation
Hi Colin, I think this would work as a lab setup for testing how hadoop handles hardware failures but I was thinking of the opposite on a much larger scale. It is though by many IT individuals that its smart to take one high end system and split into multiple machines, though few people think of taking multiple machines and integrating them into one and then virtualising them apart again. This has been done though end data storage has almost always resided on a SAN which has a cost implication. I know Hadoops Distributed Filesystem is geared towards batch processing and larger data sets with higher latency, though has anyone every tried integrating it with a virtual server tech like kvm? The other possibility is that Im nuts and have completely lost the plot. Kind Regards Brad On Fri, Jun 6, 2008 at 5:03 PM, Colin Freas <[EMAIL PROTECTED]> wrote: > I've wondered about this using single or dual quad-core machines with one > spindle per core, and partitioning them out into 2, 4, 8, whatever virtual > machines, possibly marking each physical box as a "rack". > > There would be some initial and ongoing sysadmin costs. But could this > increase thoughput on a small cluster, of 2 or 3 boxes with 16 or 24 cores, > with many jobs by limiting the number of cores each job runs on, to say 8? > Has anyone tried such a setup? > > > On Fri, Jun 6, 2008 at 10:30 AM, Brad C <[EMAIL PROTECTED]> wrote: > >> Hello Everyone, >> >> I've been brainstorming recently and its always been in the back of my >> mind, hadoop offers the functionality of clustering comodity systems >> together, but how would one go about virtualising them apart again? >> >> Kind Regards >> >> Brad :) >> >
Re: Hadoop Distributed Virtualisation
I've wondered about this using single or dual quad-core machines with one spindle per core, and partitioning them out into 2, 4, 8, whatever virtual machines, possibly marking each physical box as a "rack". There would be some initial and ongoing sysadmin costs. But could this increase thoughput on a small cluster, of 2 or 3 boxes with 16 or 24 cores, with many jobs by limiting the number of cores each job runs on, to say 8? Has anyone tried such a setup? On Fri, Jun 6, 2008 at 10:30 AM, Brad C <[EMAIL PROTECTED]> wrote: > Hello Everyone, > > I've been brainstorming recently and its always been in the back of my > mind, hadoop offers the functionality of clustering comodity systems > together, but how would one go about virtualising them apart again? > > Kind Regards > > Brad :) >
Hadoop Distributed Virtualisation
Hello Everyone, I've been brainstorming recently and its always been in the back of my mind, hadoop offers the functionality of clustering comodity systems together, but how would one go about virtualising them apart again? Kind Regards Brad :)