Is there something else I could read about setting up short-lived Hadoop clusters on virtual machines? I have no experience with VMs at all. I see there is quite a bit of material about using them to get Hadoop up and running with a psuedo-cluster on a single machine, but I don't follow how this stretches out to using multiple machines allocated by Torque.
Thanks, Dave On Tue, Jun 15, 2010 at 3:49 AM, Steve Loughran <ste...@apache.org> wrote: > Edward Capriolo wrote: > >> >> I have not used it much, but I think HOD is pretty cool. I guess most >> people >> who are looking to (spin up, run job ,transfer off, spin down) are using >> EC2. HOD does something like make private hadoop clouds on your hardware >> and >> many probably do not have that use case. As schedulers advance and get >> better HOD becomes less attractive, but I can always see a place for it. > > I don't know who is using it, or maintaining it; we've been bringing up > short-lived Hadoop clusters different. > > I think I should write a little article on the topic; I presented about it > at Berlin Buzzwords last week. > > Short lived Hadoop clusters on VMs are fine if you don't have enough data or > CPU load to justify a set of dedicated physical machines, and is a good way > of experimenting with Hadoop at scale. You can maybe lock down the network > better too, though that depends on your VM infrastructure. > > Where VMs are weak is in disk IO performance, but there's no reason why the > VM infrastructure can't take a list of filenames/directories as a hint for > VM placement (placement is the new scheduling, incidentally), and > virtualized IO can only improve. If you can run Hadoop MapReduce directly > against SAN-mounted storage then you can stop worrying about locality of > data and still gain from parallelisation of the operations. > > > -steve > > >