Probably a wrong question in a wrong thread and wrong mailing list :)
On 7/10/12, Subir S <subir.sasiku...@gmail.com> wrote: > Is there any property to convey the maximum amount of data each > reducer/partition may take for processing. Like the bytes_per_reducer > of pig, so that the count of reducers can be controlled based on size > of intermediate map output data size? > > On 7/10/12, Karthik Kambatla <ka...@cloudera.com> wrote: >> The partitioner is configurable. The default partitioner, from what I >> remember, computes the partition as the hashcode modulo number of >> reducers/partitions. For random input, it is balanced, but some cases can >> have very skewed key distribution. Also, as you have pointed out, the >> number of values per key can also vary. Together, both of them determine >> "weight" of each partition as you call it. >> >> Karthik >> >> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rgra...@yahoo.com> wrote: >> >>> Thanks Arun. >>> >>> So just for my clarification. The map will create partitions according >>> to >>> the number of reducers s.t. each reducer to get almost same number of >>> keys >>> in its partition. However, each key can have different number of values >>> so >>> the "weight" of each partition will depend on that. Also when a new >>> <key, >>> value> is added into a partition a hash on the partition ID will be >>> computed to find the corresponding partition ? >>> >>> Robert >>> >>> ------------------------------ >>> *From:* Arun C Murthy <a...@hortonworks.com> >>> *To:* mapreduce-user@hadoop.apache.org >>> *Sent:* Monday, July 9, 2012 4:33 PM >>> >>> *Subject:* Re: Basic question on how reducer works >>> >>> >>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: >>> >>> Thanks a lot guys for answers. >>> >>> Still I am not able to find exactly the code for the following things: >>> >>> 1. reducer to read from a Map output only its partition. I looked into >>> ReduceTask#getMapOutput which do the actual read in >>> ReduceTask#shuffleInMemory, but I don't see where it specify which >>> partition to read(reduceID). >>> >>> >>> Look at TaskTracker.MapOutputServlet. >>> >>> 2. still don't understand very well in which part of the >>> code(MapTask.java) the intermediate data is written do which partition. >>> So >>> MapOutputBuffer is the one who actually writes the data to buffer and >>> spill >>> after buffer is full. Could you please elaborate a bit on how the data >>> is >>> written to which partition ? >>> >>> >>> Essentially you can think of the partition-id as the 'primary key' and >>> the >>> actual 'key' in the map-output of <key, value> as the 'secondary key'. >>> >>> hth, >>> Arun >>> >>> Thanks, >>> Robert >>> >>> ------------------------------ >>> *From:* Arun C Murthy <a...@hortonworks.com> >>> *To:* mapreduce-user@hadoop.apache.org >>> *Sent:* Monday, July 9, 2012 9:24 AM >>> *Subject:* Re: Basic question on how reducer works >>> >>> Robert, >>> >>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >>> >>> Hi, >>> >>> I have some questions related to basic functionality in Hadoop. >>> >>> 1. When a Mapper process the intermediate output data, how it knows how >>> many partitions to do(how many reducers will be) and how much data to go >>> in >>> each partition for each reducer ? >>> >>> 2. A JobTracker when assigns a task to a reducer, it will also specify >>> the >>> locations of intermediate output data where it should retrieve it right >>> ? >>> But how a reducer will know from each remote location with intermediate >>> output what portion it has to retrieve only ? >>> >>> >>> To add to Harsh's comment. Essentially the TT *knows* where the output >>> of >>> a given map-id/reduce-id pair is present via an output-file/index-file >>> combination. >>> >>> Arun >>> >>> -- >>> Arun C. Murthy >>> Hortonworks Inc. >>> http://hortonworks.com/ >>> >>> >>> >>> >>> >>> -- >>> Arun C. Murthy >>> Hortonworks Inc. >>> http://hortonworks.com/ >>> >>> >>> >>> >>> >> >