Re: Basic question on how reducer works

Subir S Tue, 10 Jul 2012 08:30:06 -0700

Is there any property to convey the maximum amount of data each
reducer/partition may take for processing. Like the bytes_per_reducer
of pig, so that the count of reducers can be controlled based on size
of intermediate map output data size?


On 7/10/12, Karthik Kambatla <ka...@cloudera.com> wrote:
> The partitioner is configurable. The default partitioner, from what I
> remember, computes the partition as the hashcode modulo number of
> reducers/partitions. For random input, it is balanced, but some cases can
> have very skewed key distribution. Also, as you have pointed out, the
> number of values per key can also vary. Together, both of them determine
> "weight" of each partition as you call it.
>
> Karthik
>
> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rgra...@yahoo.com> wrote:
>
>> Thanks Arun.
>>
>> So just for my clarification. The map will create partitions according to
>> the number of reducers s.t. each reducer to get almost same number of
>> keys
>> in its partition. However, each key can have different number of values
>> so
>> the "weight" of each partition will depend on that. Also when a new <key,
>> value> is added into a partition a hash on the partition ID will be
>> computed to find the corresponding partition ?
>>
>> Robert
>>
>>   ------------------------------
>> *From:* Arun C Murthy <a...@hortonworks.com>
>> *To:* mapreduce-user@hadoop.apache.org
>> *Sent:* Monday, July 9, 2012 4:33 PM
>>
>> *Subject:* Re: Basic question on how reducer works
>>
>>
>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
>>
>> Thanks a lot guys for answers.
>>
>> Still I am not able to find exactly the code for the following things:
>>
>> 1. reducer to read from a Map output only its partition. I looked into
>> ReduceTask#getMapOutput which do the actual read in
>> ReduceTask#shuffleInMemory, but I don't see where it specify which
>> partition to read(reduceID).
>>
>>
>> Look at TaskTracker.MapOutputServlet.
>>
>> 2. still don't understand very well in which part of the
>> code(MapTask.java) the intermediate data is written do which partition.
>> So
>> MapOutputBuffer is the one who actually writes the data to buffer and
>> spill
>> after buffer is full. Could you please elaborate a bit on how the data is
>> written to which partition ?
>>
>>
>> Essentially you can think of the partition-id as the 'primary key' and
>> the
>> actual 'key' in the map-output of <key, value> as the 'secondary key'.
>>
>> hth,
>> Arun
>>
>> Thanks,
>> Robert
>>
>>   ------------------------------
>> *From:* Arun C Murthy <a...@hortonworks.com>
>> *To:* mapreduce-user@hadoop.apache.org
>> *Sent:* Monday, July 9, 2012 9:24 AM
>> *Subject:* Re: Basic question on how reducer works
>>
>> Robert,
>>
>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote:
>>
>> Hi,
>>
>> I have some questions related to basic functionality in Hadoop.
>>
>> 1. When a Mapper process the intermediate output data, how it knows how
>> many partitions to do(how many reducers will be) and how much data to go
>> in
>> each  partition for each reducer ?
>>
>> 2. A JobTracker when assigns a task to a reducer, it will also specify
>> the
>> locations of intermediate output data where it should retrieve it right ?
>> But how a reducer will know from each remote location with intermediate
>> output what portion it has to retrieve only ?
>>
>>
>> To add to Harsh's comment. Essentially the TT *knows* where the output of
>> a given map-id/reduce-id pair is present via an output-file/index-file
>> combination.
>>
>> Arun
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>>
>>
>>
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>>
>>
>>
>>
>

Re: Basic question on how reducer works

Reply via email to