Hi Ajit, Take care that increasing *io.sort.mb* will increase also the time of each spill (Sort, [+Combine], [+Compress]). You can check Starfish's cost-based optimizer (CBO)<http://www.cs.duke.edu/starfish/tutorial/optimize.html>. The case of higher intermediate data size is expressed by Starfish's CBO terminology as MAP_RECORDS_SEL >> 1.
You can send me your job along with some input files, I will run it and provide you back with the tuned configuration parameters recommended by Starfish. On Tue, Feb 21, 2012 at 8:44 AM, Ajit Ratnaparkhi < ajit.ratnapar...@gmail.com> wrote: > Thanks Anand. > > Mine combiner is same as reducer and it reduces data a lot (result data > size would be less than 0.1 % of input data size). I tried setting these > properties(io.sort.mb to 500mb from 100mb, java heap size is 1GB), it > improved performance but not much. > > > On Tue, Feb 21, 2012 at 3:26 PM, Anand Srivastava < > anand.srivast...@guavus.com> wrote: > >> Hi Ajit, >> You could experiment with a higher value of "io.sort.mb" so that >> the combiner is more effective. However if you combiner is such that it >> does not really 'reduce' the number of records, it would not help. You will >> have to increase the java heap size as well (mapred.child.java.opts) so >> that your tasks don't go out of memory. >> >> Regards, >> Anand >> >> On 21-Feb-2012, at 3:09 PM, Ajit Ratnaparkhi wrote: >> >> > Hi, >> > >> > This about a typical pattern of map-reduce jobs, >> > >> > There are some map-reduce jobs in which map phase generates records >> which are more in number than its input, at reduce phase this data reduces >> a lot and final output of reduce is very small. >> > Eg. Each map function call ie. for each input record map generates >> approx 100 output records(one output record is approx of same size as one >> input record). Combiner is applied, output of map is shuffled and it >> reaches reducer, where it is reduced to very small size output data (say >> less than 0.1% of input data size to map). >> > >> > Time taken for execution of such kind of job(where output of map is >> more than its input) is considerably high if you compare those with jobs >> with same/less output map records for same input data. >> > >> > Has anybody worked on optimizing such jobs? any configuration tuning >> which might help here? >> > >> > -Ajit >> > >> > >> >> > -- Best Regards, Mostafa Ead