Re: Basic question on how reducer works

Grandl Robert Mon, 09 Jul 2012 20:16:18 -0700

Thanks Arun.

So just for my clarification. The map will create partitions according to the 
number of reducers s.t. each reducer to get almost same number of keys in its 
partition. However, each key can have different number of values so the 
"weight" of each partition will depend on that. Also when a new <key, value> is 
added into a partition a hash on the partition ID will be computed to find the 
corresponding partition ?


Robert



________________________________
 From: Arun C Murthy <a...@hortonworks.com>
To: mapreduce-user@hadoop.apache.org 
Sent: Monday, July 9, 2012 4:33 PM
Subject: Re: Basic question on how reducer works
 



On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:

Thanks a lot guys for answers. 
>
>
>
>Still I am not able to find exactly the code for the following things:
>
>
>1. reducer to read from a Map output only its partition. I looked into 
>ReduceTask#getMapOutput which do the actual read in 
>ReduceTask#shuffleInMemory, but I don't see where it specify which partition 
>to read(reduceID).
>
>
Look at TaskTracker.MapOutputServlet.


2. still don't understand very well in which part of the code(MapTask.java) the 
intermediate data is written do which partition. So MapOutputBuffer is the one 
who actually writes the data to buffer and spill after buffer is full. Could 
you please elaborate a bit on how the data is written to which partition ?
>
>
Essentially you can think of the partition-id as the 'primary key' and the 
actual 'key' in the map-output of <key, value> as the 'secondary key'.

hth,
Arun


Thanks,
>Robert
>
>
>
>________________________________
> From: Arun C Murthy <a...@hortonworks.com>
>To: mapreduce-user@hadoop.apache.org 
>Sent: Monday, July 9, 2012 9:24 AM
>Subject: Re: Basic question on how reducer works
> 
>
>Robert,
>
>
>On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>
>Hi,
>>
>>
>>I have some questions related to basic functionality in Hadoop. 
>>
>>
>>1. When a Mapper process the intermediate output data, how it knows how many 
>>partitions to do(how many reducers will be) and how much data to go in each  
>>partition for each reducer ?
>>
>>
>>2. A JobTracker when assigns a task to a reducer, it will also specify the 
>>locations of intermediate output data where it should retrieve it right ? But 
>>how a reducer will know from each remote location with intermediate output 
>>what portion it has to retrieve only ?
>
>To add to Harsh's comment. Essentially the TT *knows* where the output of a 
>given map-id/reduce-id pair is present via an output-file/index-file 
>combination.
>
>
>Arun
>
>
>--
>Arun C. Murthy
>Hortonworks Inc.
>http://hortonworks.com/
>
> 
>
>
>

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Basic question on how reducer works

Reply via email to