Hello, Inline.
On Fri, Oct 29, 2010 at 10:30 AM, bharath v <[email protected]> wrote: > Hi , > > > After the partitioning phase on each mapper output , on what basis is > a partition assigned to a reducer ? Is this random or does hadoop > employ some strategy ? Partitioning itself is a strategy to send appropriate data of map outputs to the reducers. A partition function, which can be user defined and is HashPartitioner by default, is supposed to, for every record, emit a number n, such that 0 < n < no. of reducers. Usually this is handled with (a hash of the key received in the map output) mod (no. of reducers). Look at JobConf.setPartitioner(...) to set your own. And this: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html for the Partitioner interface. Each reduce 'task' has an identification number when its assigned as one. The reducer then fetches only those map output files which match its identification number (TaskAttemptID.getTaskID.getID() essentially -- the identifier of any task on the whole). Map output files that are created for reducers have the partition function result (which is the reduceID it has to go to) associated with them, and only those files are pulled (and a defensive check is also done). > > Where can I find the portion of code doing this decision ? I tried > Reducer / ReducerContext classes but couldn't find it. If you're wanting to look at how the partitioner function is to be applied, have a look at MapTask.OldOutputCollector.collect() or MapTask.NewOutputCollector.write() onwards. > > Also where can I find the reducer code which fetches the data that > needs to be reduced by it .. If you want to look at how the reduce task fetches and utilizes the data, look at the Shuffle (Shuffle.run()) part of the ReduceTask.run() onwards. Classes mentioned mostly reside under o.a.h.mapred package. Some may lie outside, but I believe your IDE can find it with a search. Disclaimer, hah! -- Harsh J www.harshj.com
