Hello,

Inline.

On Fri, Oct 29, 2010 at 10:30 AM, bharath v
<[email protected]> wrote:
> Hi ,
>
>
> After the partitioning phase on each mapper output , on what basis is
> a partition assigned to a reducer ?  Is this random or does hadoop
> employ some strategy ?

Partitioning itself is a strategy to send appropriate data of map
outputs to the reducers. A partition function, which can be user
defined and is HashPartitioner by default, is supposed to, for every
record, emit a number n, such that 0 < n < no. of reducers. Usually
this is handled with (a hash of the key received in the map output)
mod (no. of reducers). Look at JobConf.setPartitioner(...) to set your
own. And this: 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Partitioner.html
for the Partitioner interface.

Each reduce 'task' has an identification number when its assigned as
one. The reducer then fetches only those map output files which match
its identification number (TaskAttemptID.getTaskID.getID() essentially
-- the identifier of any task on the whole). Map output files that are
created for reducers have the partition function result (which is the
reduceID it has to go to) associated with them, and only those files
are pulled (and a defensive check is also done).

>
> Where can I find the portion of code doing this decision ? I tried
> Reducer / ReducerContext classes but couldn't find it.

If you're wanting to look at how the partitioner function is to be
applied, have a look at MapTask.OldOutputCollector.collect() or
MapTask.NewOutputCollector.write() onwards.

>
> Also where can I find the reducer code which fetches the data that
> needs to be reduced by it ..

If you want to look at how the reduce task fetches and utilizes the
data, look at the Shuffle (Shuffle.run()) part of the ReduceTask.run()
onwards.

Classes mentioned mostly reside under o.a.h.mapred package. Some may
lie outside, but I believe your IDE can find it with a search.
Disclaimer, hah!

-- 
Harsh J
www.harshj.com

Reply via email to