Re: Optimizing jobs where map phase generates more output than its input.

Mostafa Gaber Tue, 21 Feb 2012 15:53:40 -0800

Hi Ajit,

Take care that increasing *io.sort.mb* will increase also the time of each
spill (Sort, [+Combine], [+Compress]). You can check Starfish's cost-based
optimizer (CBO)<http://www.cs.duke.edu/starfish/tutorial/optimize.html>.
The case of higher intermediate data size is expressed by Starfish's CBO
terminology as MAP_RECORDS_SEL >> 1.


You can send me your job along with some input files, I will run it and
provide you back with the tuned configuration parameters recommended by
Starfish.

On Tue, Feb 21, 2012 at 8:44 AM, Ajit Ratnaparkhi <
ajit.ratnapar...@gmail.com> wrote:

> Thanks Anand.
>
> Mine combiner is same as reducer and it reduces data a lot (result data
> size would be less than 0.1 % of input data size). I tried setting these
> properties(io.sort.mb to 500mb from 100mb, java heap size is 1GB), it
> improved performance but not much.
>
>
> On Tue, Feb 21, 2012 at 3:26 PM, Anand Srivastava <
> anand.srivast...@guavus.com> wrote:
>
>> Hi Ajit,
>>        You could experiment with a higher value of "io.sort.mb" so that
>> the combiner is more effective. However if you combiner is such that it
>> does not really 'reduce' the number of records, it would not help. You will
>> have to increase the java heap size as well (mapred.child.java.opts) so
>> that your tasks don't go out of memory.
>>
>> Regards,
>> Anand
>>
>> On 21-Feb-2012, at 3:09 PM, Ajit Ratnaparkhi wrote:
>>
>> > Hi,
>> >
>> > This about a typical pattern of map-reduce jobs,
>> >
>> > There are some map-reduce jobs in which map phase generates records
>> which are more in number than its input, at reduce phase this data reduces
>> a lot and final output of reduce is very small.
>> > Eg. Each map function call ie. for each input record map generates
>> approx 100 output records(one output record is approx of same size as one
>> input record). Combiner is applied, output of map is shuffled and it
>> reaches reducer, where it is reduced to very small size output data (say
>> less than 0.1% of input data size to map).
>> >
>> > Time taken for execution of such kind of job(where output of map is
>> more than its input) is considerably high if you compare those with jobs
>> with same/less output map records for same input data.
>> >
>> > Has anybody worked on optimizing such jobs? any configuration tuning
>> which might help here?
>> >
>> > -Ajit
>> >
>> >
>>
>>
>


-- 
Best Regards,
Mostafa Ead

Re: Optimizing jobs where map phase generates more output than its input.

Reply via email to