Say that you have a taskSet of maps, each operating on one Hadoop partition. How does the scheduler decide which mapTask output (i.e., a shuffle block) goes to what reducer? Are the shuffle blocks evenly split among reducers?
On Sun, Nov 10, 2013 at 9:50 PM, Aaron Davidson <ilike...@gmail.com> wrote: > It is responsible for a subset of shuffle blocks. MapTasks split up their > data, creating one shuffle block for every reducer. During the shuffle > phase, the reducer will fetch all shuffle blocks that were intended for it. > > > On Sun, Nov 10, 2013 at 9:38 PM, Umar Javed <umarj.ja...@gmail.com> wrote: > >> I was wondering how does the scheduler assign the ShuffledRDD locations >> to the reduce tasks? Say that you have 4 reduce tasks, and a number of >> shuffle blocks across two machines. Is each reduce task responsible for a >> subset of individual keys or a subset of shuffle blocks? >> >> Umar >> > >