Job performance issue: output.collect()

Oded Rosen Wed, 01 Sep 2010 07:30:10 -0700

Hi all,

My job (written in old 0.18 api, but that's not the issue here) is producing
large amounts of map output.
Each map() call generates about ~20 output.collects, and each output is
pretty big (~1K) => each map() produces about 20K.
All of this data is fed to a combiner that really reduces the output's size
+ amounts.
the job input is not so big: there are about 120M map input records.


This job is pretty slow. Other jobs that work on the same input are much
faster, since they do not produce so much output.
Analyzing the job performance (timing the map() function parts), I've seen
that much time is spent on the output.collect() line itself.

I know that during the output.collect() command the output is being written
to local filesystem spills (when the spill buffer reaches a 80% limit),
so I guessed that reducing the size of each output will improve performance.
This was not the case - after cutting 30% of the map output size, the job
took the same amount of time. The thing that I cannot reduce is the amount
of output lines being written out of the map.

I would like to know what happens in the output.collect line that takes lots
of time, in order to cut down this job's running time.
Please keep in mind that I have a combiner, and to my understanding
different things happen to the map output when a combiner is present.

Can anyone help me understand how can I save this precious time?
Thanks,

-- 
Oded

Job performance issue: output.collect()

Reply via email to