On Fri, Aug 21, 2009 at 12:11 PM, bharath vissapragada <
bharathvissapragada1...@gmail.com> wrote:

> Yes , My doubt is that how is the location of the reducer selected . Is it
> selected arbitrarily or is selected on a particular machine which has
> already the more values (corresponding to the key of that reducer) which
> reduces the cost of transferring data across the network(because already
> many values to that key are on that machine where the map phase
> completed)..

I think what you're asking for is whether a ReduceTask is scheduled on a
node which has the largest partition among all the mapoutput partitions
(p1-pN) that the ReduceTask has to fetch in order to do its job. The answer
is "no" - the ReduceTasks are assigned arbitrarily (no such optimization is
done and I think this can really be an optimization only if 1 of your
partitions is heavily skewed for some reason). Also as Amogh pointed out,
the ReduceTasks start fetching their mapoutput-partitions (shuffle phase) as
and when they hear about completed ones. So it would not be possible to
schedule ReduceTasks only on nodes with the largest partitions.

Harish Mallipeddi

Reply via email to