Re: Optimizing jobs where map phase generates more output than its input.

Anand Srivastava Tue, 21 Feb 2012 01:57:04 -0800

Hi Ajit,
        You could experiment with a higher value of "io.sort.mb" so that the 
combiner is more effective. However if you combiner is such that it does not 
really 'reduce' the number of records, it would not help. You will have to 
increase the java heap size as well (mapred.child.java.opts) so that your tasks 
don't go out of memory.


Regards,
Anand

On 21-Feb-2012, at 3:09 PM, Ajit Ratnaparkhi wrote:

> Hi,
> 
> This about a typical pattern of map-reduce jobs,
> 
> There are some map-reduce jobs in which map phase generates records which are 
> more in number than its input, at reduce phase this data reduces a lot and 
> final output of reduce is very small.
> Eg. Each map function call ie. for each input record map generates approx 100 
> output records(one output record is approx of same size as one input record). 
> Combiner is applied, output of map is shuffled and it reaches reducer, where 
> it is reduced to very small size output data (say less than 0.1% of input 
> data size to map).
> 
> Time taken for execution of such kind of job(where output of map is more than 
> its input) is considerably high if you compare those with jobs with same/less 
> output map records for same input data.
> 
> Has anybody worked on optimizing such jobs? any configuration tuning which 
> might help here?
> 
> -Ajit
> 
>

Re: Optimizing jobs where map phase generates more output than its input.

Reply via email to