Re: Basic question on how reducer works

Subir S Fri, 13 Jul 2012 22:50:28 -0700

Probably a wrong question in a wrong thread and wrong mailing list :)


On 7/10/12, Subir S <subir.sasiku...@gmail.com> wrote:
> Is there any property to convey the maximum amount of data each
> reducer/partition may take for processing. Like the bytes_per_reducer
> of pig, so that the count of reducers can be controlled based on size
> of intermediate map output data size?
>
> On 7/10/12, Karthik Kambatla <ka...@cloudera.com> wrote:
>> The partitioner is configurable. The default partitioner, from what I
>> remember, computes the partition as the hashcode modulo number of
>> reducers/partitions. For random input, it is balanced, but some cases can
>> have very skewed key distribution. Also, as you have pointed out, the
>> number of values per key can also vary. Together, both of them determine
>> "weight" of each partition as you call it.
>>
>> Karthik
>>
>> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rgra...@yahoo.com> wrote:
>>
>>> Thanks Arun.
>>>
>>> So just for my clarification. The map will create partitions according
>>> to
>>> the number of reducers s.t. each reducer to get almost same number of
>>> keys
>>> in its partition. However, each key can have different number of values
>>> so
>>> the "weight" of each partition will depend on that. Also when a new
>>> <key,
>>> value> is added into a partition a hash on the partition ID will be
>>> computed to find the corresponding partition ?
>>>
>>> Robert
>>>
>>>   ------------------------------
>>> *From:* Arun C Murthy <a...@hortonworks.com>
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Sent:* Monday, July 9, 2012 4:33 PM
>>>
>>> *Subject:* Re: Basic question on how reducer works
>>>
>>>
>>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>>>
>>> Thanks a lot guys for answers.
>>>
>>> Still I am not able to find exactly the code for the following things:
>>>
>>> 1. reducer to read from a Map output only its partition. I looked into
>>> ReduceTask#getMapOutput which do the actual read in
>>> ReduceTask#shuffleInMemory, but I don't see where it specify which
>>> partition to read(reduceID).
>>>
>>>
>>> Look at TaskTracker.MapOutputServlet.
>>>
>>> 2. still don't understand very well in which part of the
>>> code(MapTask.java) the intermediate data is written do which partition.
>>> So
>>> MapOutputBuffer is the one who actually writes the data to buffer and
>>> spill
>>> after buffer is full. Could you please elaborate a bit on how the data
>>> is
>>> written to which partition ?
>>>
>>>
>>> Essentially you can think of the partition-id as the 'primary key' and
>>> the
>>> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>>>
>>> hth,
>>> Arun
>>>
>>> Thanks,
>>> Robert
>>>
>>>   ------------------------------
>>> *From:* Arun C Murthy <a...@hortonworks.com>
>>> *To:* mapreduce-user@hadoop.apache.org
>>> *Sent:* Monday, July 9, 2012 9:24 AM
>>> *Subject:* Re: Basic question on how reducer works
>>>
>>> Robert,
>>>
>>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>>
>>> Hi,
>>>
>>> I have some questions related to basic functionality in Hadoop.
>>>
>>> 1. When a Mapper process the intermediate output data, how it knows how
>>> many partitions to do(how many reducers will be) and how much data to go
>>> in
>>> each  partition for each reducer ?
>>>
>>> 2. A JobTracker when assigns a task to a reducer, it will also specify
>>> the
>>> locations of intermediate output data where it should retrieve it right
>>> ?
>>> But how a reducer will know from each remote location with intermediate
>>> output what portion it has to retrieve only ?
>>>
>>>
>>> To add to Harsh's comment. Essentially the TT *knows* where the output
>>> of
>>> a given map-id/reduce-id pair is present via an output-file/index-file
>>> combination.
>>>
>>> Arun
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Arun C. Murthy
>>> Hortonworks Inc.
>>> http://hortonworks.com/
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Basic question on how reducer works

Reply via email to