Each reduce task should follow those phases: 1. Shuffle: copy intermediate results from all mappers and store them in the memory buffer. Once the memory buffer is full, data copied from the mappers are merged, combined and *spilled to disk* 2. Merge: Merge the spills 3. Reduce: calling the reduce method 4. Write the output to HDFS
So, to reduce IO, you can: - increase the size of this memory buffer used at shuffle phase by setting the conf. parameter *mapred.job.shuffle.input.buffer.percent* to a higher value. That will reduce the number of disk spills and hence less IO. On Tue, Feb 7, 2012 at 2:52 AM, Marek Miglinski <mmiglin...@seven.com>wrote: > Thanks for the reply, > > As it turns out that didn't help, IO is used even more as each reducer is > copying and sorting. What are the options? Is there an option to limit > reduce - > copy and reduce - > sort somehow? > > > Thanks, > Marek M. > > ________________________________ > From: Mostafa Gaber [moustafa.ga...@gmail.com] > Sent: Monday, February 06, 2012 6:50 PM > To: mapreduce-user@hadoop.apache.org > Subject: Re: Reducer IO > > Hello Marek, > > I think you can increase number of reducers for your MR job so as to > reduce the amount of intermediate key-value pairs assigned to each reducer. > Note also that the number of reducers is dependent on your job and how the > output should be produced. > > On Mon, Feb 6, 2012 at 11:37 AM, Marek Miglinski <mmiglin...@seven.com > <mailto:mmiglin...@seven.com>> wrote: > Hey, > > I have a mapreduce job (transactions loader) and the main problem of it is > "reduce->copy" and "reduce->sort" phase which takes all IO and uses all > disk resources, what are the possible ways to reduce this load? My cloud > settings are: > > ioSortFactor=80 > ioSortMb=800 > (mapredChildJavaOpts=Xmx1152m) > > I can lower those settings, what else can I tweak? > > > Thanks, > Marek M. > > > > -- > Best Regards, > Mostafa Ead > > -- Best Regards, Mostafa Ead