Is there any property to convey the maximum amount of data each reducer/partition may take for processing. Like the bytes_per_reducer of pig, so that the count of reducers can be controlled based on size of intermediate map output data size?
On 7/10/12, Karthik Kambatla <ka...@cloudera.com> wrote: > The partitioner is configurable. The default partitioner, from what I > remember, computes the partition as the hashcode modulo number of > reducers/partitions. For random input, it is balanced, but some cases can > have very skewed key distribution. Also, as you have pointed out, the > number of values per key can also vary. Together, both of them determine > "weight" of each partition as you call it. > > Karthik > > On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert <rgra...@yahoo.com> wrote: > >> Thanks Arun. >> >> So just for my clarification. The map will create partitions according to >> the number of reducers s.t. each reducer to get almost same number of >> keys >> in its partition. However, each key can have different number of values >> so >> the "weight" of each partition will depend on that. Also when a new <key, >> value> is added into a partition a hash on the partition ID will be >> computed to find the corresponding partition ? >> >> Robert >> >> ------------------------------ >> *From:* Arun C Murthy <a...@hortonworks.com> >> *To:* mapreduce-user@hadoop.apache.org >> *Sent:* Monday, July 9, 2012 4:33 PM >> >> *Subject:* Re: Basic question on how reducer works >> >> >> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: >> >> Thanks a lot guys for answers. >> >> Still I am not able to find exactly the code for the following things: >> >> 1. reducer to read from a Map output only its partition. I looked into >> ReduceTask#getMapOutput which do the actual read in >> ReduceTask#shuffleInMemory, but I don't see where it specify which >> partition to read(reduceID). >> >> >> Look at TaskTracker.MapOutputServlet. >> >> 2. still don't understand very well in which part of the >> code(MapTask.java) the intermediate data is written do which partition. >> So >> MapOutputBuffer is the one who actually writes the data to buffer and >> spill >> after buffer is full. Could you please elaborate a bit on how the data is >> written to which partition ? >> >> >> Essentially you can think of the partition-id as the 'primary key' and >> the >> actual 'key' in the map-output of <key, value> as the 'secondary key'. >> >> hth, >> Arun >> >> Thanks, >> Robert >> >> ------------------------------ >> *From:* Arun C Murthy <a...@hortonworks.com> >> *To:* mapreduce-user@hadoop.apache.org >> *Sent:* Monday, July 9, 2012 9:24 AM >> *Subject:* Re: Basic question on how reducer works >> >> Robert, >> >> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >> >> Hi, >> >> I have some questions related to basic functionality in Hadoop. >> >> 1. When a Mapper process the intermediate output data, how it knows how >> many partitions to do(how many reducers will be) and how much data to go >> in >> each partition for each reducer ? >> >> 2. A JobTracker when assigns a task to a reducer, it will also specify >> the >> locations of intermediate output data where it should retrieve it right ? >> But how a reducer will know from each remote location with intermediate >> output what portion it has to retrieve only ? >> >> >> To add to Harsh's comment. Essentially the TT *knows* where the output of >> a given map-id/reduce-id pair is present via an output-file/index-file >> combination. >> >> Arun >> >> -- >> Arun C. Murthy >> Hortonworks Inc. >> http://hortonworks.com/ >> >> >> >> >> >> -- >> Arun C. Murthy >> Hortonworks Inc. >> http://hortonworks.com/ >> >> >> >> >> >