Re: Job performance issue: output.collect()

Owen O'Malley Wed, 01 Sep 2010 09:10:33 -0700


On Sep 1, 2010, at 5:18 AM, Oded Rosen wrote:

I would like to know what happens in the output.collect line thattakes lots

of time, in order to cut down this job's running time.
Please keep in mind that I have a combiner, and to my understanding
different things happen to the map output when a combiner is present.

The best presentation on the map side sort is the one that ChrisDouglas (who did most of the implementation) did for the Bay Area HUG.


http://developer.yahoo.net/blogs/hadoop/2010/01/hadoop_bay_area_january_2010_u.html

There are both slides and a video of the presentation. I'd run throughthat first.

You most likely are getting more spills than you deserve. Thevariables to look at:


io.sort.mb - should be most of the task's ram budget
io.sort.record.percent - depends on record size
io.sort.factor - typically 25 * (# of disks / node)

-- Owen

Re: Job performance issue: output.collect()

Reply via email to