Colin Freas wrote:
I've wondered about this using single or dual quad-core machines with one spindle per core, and partitioning them out into 2, 4, 8, whatever virtual machines, possibly marking each physical box as a "rack".
Or just host VM images with multi-cpu support and make it one 'machine' with the appopriate # of task trackers.
There would be some initial and ongoing sysadmin costs. But could this increase thoughput on a small cluster, of 2 or 3 boxes with 16 or 24 cores, with many jobs by limiting the number of cores each job runs on, to say 8? Has anyone tried such a setup?
I can see reasons for virtualisation -better isolation and security- but dont think performance would improve. More likely anything clock-sensitive will get confused if, under load, some VMs get less cpu time than they expect.
A better approach could be to improve the schedulers in hadoop to put work where it is best, maybe even move if if things are taking too long, etc.