On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote:
Briefly going through the DistributedCache information, it seems to
be a way
to distribute files to mappers/reducers.
Sure, but it handles the distribution problem for you.
One still needs to read the
contents into each map/reduce task VM.
If the data is straight binary data, you could just mmap it from the
various tasks. It would be pretty efficient.
The other direction is to use the MultiThreadedMapRunner and run
multiple maps as threads in the same VM. But unless your maps are CPU
heavy or contacting external servers, it probably won't help as much
as you'd like.
-- Owen