On Fri, Aug 21, 2009 at 12:11 PM, bharath vissapragada < bharathvissapragada1...@gmail.com> wrote:
> Yes , My doubt is that how is the location of the reducer selected . Is it > selected arbitrarily or is selected on a particular machine which has > already the more values (corresponding to the key of that reducer) which > reduces the cost of transferring data across the network(because already > many values to that key are on that machine where the map phase > completed).. > I think what you're asking for is whether a ReduceTask is scheduled on a node which has the largest partition among all the mapoutput partitions (p1-pN) that the ReduceTask has to fetch in order to do its job. The answer is "no" - the ReduceTasks are assigned arbitrarily (no such optimization is done and I think this can really be an optimization only if 1 of your partitions is heavily skewed for some reason). Also as Amogh pointed out, the ReduceTasks start fetching their mapoutput-partitions (shuffle phase) as and when they hear about completed ones. So it would not be possible to schedule ReduceTasks only on nodes with the largest partitions. -- Harish Mallipeddi http://blog.poundbang.in