Re: Optimizing jobs where map phase generates more output than its input.

Jie Li Sat, 25 Feb 2012 07:24:15 -0800

Thanks Mostafa for referencing Starfish! Let me have a quick introduction
of what Starfish can do.


Starfish is a self-tuning system built on Hadoop to provide good
performance automatically, without any need for users to understand and
manipulate the many tuning knobs in Hadoop.

With Starfish, you can analyze the performance of your Hadoop job at fine
grained level, e.g. the time for map processing, spilling, merging,
shuffling, sorting, and reduce processing.  So you can understand which
part is the bottleneck of the performance.

You can also ask "what-if" questions, e.g. "What if I double io.sort.mb ?",
and Starfish will predict the new behaviour of the job, so you can better
understand how these parameters work.  In addition, you can simply delegate
Starfish to find the optimal configurations for you to achieve the best
performance.

Welcome to join our Google Group to discuss more about Starfish and any
feedback will be appreciated. If you meet any problems, please don't
hesitate to let us know. The Group address is
http://groups.google.com/group/hadoop-starfish.

Thanks,
Jie
------------------------
Starfish Group, Duke University
Starfish Homepage: www.cs.duke.edu/starfish/
Starfish Google Group: http://groups.google.com/group/hadoop-starfish

On Tue, Feb 21, 2012 at 6:53 PM, Mostafa Gaber <moustafa.ga...@gmail.com>wrote:

> Hi Ajit,
>
> Take care that increasing *io.sort.mb* will increase also the time of
> each spill (Sort, [+Combine], [+Compress]). You can check Starfish's
> cost-based optimizer (CBO)<
> http://www.cs.duke.edu/starfish/tutorial/optimize.html>. The case of
> higher intermediate data size is expressed by Starfish's CBO terminology as
> MAP_RECORDS_SEL >> 1.
>
> You can send me your job along with some input files, I will run it and
> provide you back with the tuned configuration parameters recommended by
> Starfish.
>
>
> On Tue, Feb 21, 2012 at 8:44 AM, Ajit Ratnaparkhi <
> ajit.ratnapar...@gmail.com> wrote:
>
>> Thanks Anand.
>>
>> Mine combiner is same as reducer and it reduces data a lot (result data
>> size would be less than 0.1 % of input data size). I tried setting these
>> properties(io.sort.mb to 500mb from 100mb, java heap size is 1GB), it
>> improved performance but not much.
>>
>>
>> On Tue, Feb 21, 2012 at 3:26 PM, Anand Srivastava <
>> anand.srivast...@guavus.com> wrote:
>>
>>> Hi Ajit,
>>>        You could experiment with a higher value of "io.sort.mb" so that
>>> the combiner is more effective. However if you combiner is such that it
>>> does not really 'reduce' the number of records, it would not help. You will
>>> have to increase the java heap size as well (mapred.child.java.opts) so
>>> that your tasks don't go out of memory.
>>>
>>> Regards,
>>> Anand
>>>
>>> On 21-Feb-2012, at 3:09 PM, Ajit Ratnaparkhi wrote:
>>>
>>> > Hi,
>>> >
>>> > This about a typical pattern of map-reduce jobs,
>>> >
>>> > There are some map-reduce jobs in which map phase generates records
>>> which are more in number than its input, at reduce phase this data reduces
>>> a lot and final output of reduce is very small.
>>> > Eg. Each map function call ie. for each input record map generates
>>> approx 100 output records(one output record is approx of same size as one
>>> input record). Combiner is applied, output of map is shuffled and it
>>> reaches reducer, where it is reduced to very small size output data (say
>>> less than 0.1% of input data size to map).
>>> >
>>> > Time taken for execution of such kind of job(where output of map is
>>> more than its input) is considerably high if you compare those with jobs
>>> with same/less output map records for same input data.
>>> >
>>> > Has anybody worked on optimizing such jobs? any configuration tuning
>>> which might help here?
>>> >
>>> > -Ajit
>>> >
>>> >
>>>
>>>
>>
>
>
> --
> Best Regards,
> Mostafa Ead
>
>

Re: Optimizing jobs where map phase generates more output than its input.

Reply via email to