Thanks Anand. Mine combiner is same as reducer and it reduces data a lot (result data size would be less than 0.1 % of input data size). I tried setting these properties(io.sort.mb to 500mb from 100mb, java heap size is 1GB), it improved performance but not much.
On Tue, Feb 21, 2012 at 3:26 PM, Anand Srivastava < anand.srivast...@guavus.com> wrote: > Hi Ajit, > You could experiment with a higher value of "io.sort.mb" so that > the combiner is more effective. However if you combiner is such that it > does not really 'reduce' the number of records, it would not help. You will > have to increase the java heap size as well (mapred.child.java.opts) so > that your tasks don't go out of memory. > > Regards, > Anand > > On 21-Feb-2012, at 3:09 PM, Ajit Ratnaparkhi wrote: > > > Hi, > > > > This about a typical pattern of map-reduce jobs, > > > > There are some map-reduce jobs in which map phase generates records > which are more in number than its input, at reduce phase this data reduces > a lot and final output of reduce is very small. > > Eg. Each map function call ie. for each input record map generates > approx 100 output records(one output record is approx of same size as one > input record). Combiner is applied, output of map is shuffled and it > reaches reducer, where it is reduced to very small size output data (say > less than 0.1% of input data size to map). > > > > Time taken for execution of such kind of job(where output of map is more > than its input) is considerably high if you compare those with jobs with > same/less output map records for same input data. > > > > Has anybody worked on optimizing such jobs? any configuration tuning > which might help here? > > > > -Ajit > > > > > >