On Fri, Aug 21, 2009 at 12:11 PM, bharath vissapragada <
bharathvissapragada1...@gmail.com> wrote:

> Yes , My doubt is that how is the location of the reducer selected . Is it
> selected arbitrarily or is selected on a particular machine which has
> already the more values (corresponding to the key of that reducer) which
> reduces the cost of transferring data across the network(because already
> many values to that key are on that machine where the map phase
> completed)..
>

I think what you're asking for is whether a ReduceTask is scheduled on a
node which has the largest partition among all the mapoutput partitions
(p1-pN) that the ReduceTask has to fetch in order to do its job. The answer
is "no" - the ReduceTasks are assigned arbitrarily (no such optimization is
done and I think this can really be an optimization only if 1 of your
partitions is heavily skewed for some reason). Also as Amogh pointed out,
the ReduceTasks start fetching their mapoutput-partitions (shuffle phase) as
and when they hear about completed ones. So it would not be possible to
schedule ReduceTasks only on nodes with the largest partitions.

-- 
Harish Mallipeddi
http://blog.poundbang.in

Reply via email to