Is there something else I could read about setting up short-lived
Hadoop clusters on virtual machines? I have no experience with VMs at
all. I see there is quite a bit of material about using them to get
Hadoop up and running with a psuedo-cluster on a single machine, but I
don't follow how this stretches out to using multiple machines
allocated by Torque.

Thanks,
Dave

On Tue, Jun 15, 2010 at 3:49 AM, Steve Loughran <ste...@apache.org> wrote:
> Edward Capriolo wrote:
>
>>
>> I have not used it much, but I think HOD is pretty cool. I guess most
>> people
>> who are looking to (spin up, run job ,transfer off, spin down) are using
>> EC2. HOD does something like make private hadoop clouds on your hardware
>> and
>> many probably do not have that use case. As schedulers advance and get
>> better HOD becomes less attractive, but I can always see a place for it.
>
> I don't know who is using it, or maintaining it; we've been bringing up
> short-lived Hadoop clusters different.
>
> I think I should write a little article on the topic; I presented about it
> at Berlin Buzzwords last week.
>
> Short lived Hadoop clusters on VMs are fine if you don't have enough data or
> CPU load to justify a set of dedicated physical machines, and is a good way
> of experimenting with Hadoop at scale. You can maybe lock down the network
> better too, though that depends on your VM infrastructure.
>
> Where VMs are weak is in disk IO performance, but there's no reason why the
> VM infrastructure can't take a list of filenames/directories as a hint for
> VM placement (placement is the new scheduling, incidentally), and
> virtualized IO can only improve. If you can run Hadoop MapReduce directly
> against SAN-mounted storage then you can stop worrying about locality of
> data and still gain from parallelisation of the operations.
>
>
> -steve
>
>
>

Reply via email to