Hi,
I'm running Hadoop 0.19.1 on 19 nodes. I've been benchmarking a Hadoop
workload with 115 Map tasks, on two different distributed filesystems
(KFS and PVFS); in some tests, I also have a write-intensive non-Hadoop
job running in the background (an HPC checkpointing benchmark). I've
found that Hadoop sometimes makes most of the Map tasks data-local, and
sometimes makes none of the Map tasks data-local; this depends both on
which filesystem I use, and on whether the background task is running.
(I never run multiple Hadoop jobs concurrently in these tests.)
I'd like to learn how the Hadoop scheduler places Map tasks, and how
locality is taken into account, so I can figure out why this is
happening. (I'm using the default FIFO scheduler.) Is there some
documentation available that would explain this?
Thanks!
- Locality when placing Map tasks Esteban Molina-Estolano
-