On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote:

Briefly going through the DistributedCache information, it seems to be a way
to distribute files to mappers/reducers.

Sure, but it handles the distribution problem for you.

One still needs to read the
contents into each map/reduce task VM.

If the data is straight binary data, you could just mmap it from the various tasks. It would be pretty efficient.

The other direction is to use the MultiThreadedMapRunner and run multiple maps as threads in the same VM. But unless your maps are CPU heavy or contacting external servers, it probably won't help as much as you'd like.

-- Owen

Reply via email to