On Sep 1, 2010, at 5:18 AM, Oded Rosen wrote:

I would like to know what happens in the output.collect line that takes lots
of time, in order to cut down this job's running time.
Please keep in mind that I have a combiner, and to my understanding
different things happen to the map output when a combiner is present.

The best presentation on the map side sort is the one that Chris Douglas (who did most of the implementation) did for the Bay Area HUG.

http://developer.yahoo.net/blogs/hadoop/2010/01/hadoop_bay_area_january_2010_u.html

There are both slides and a video of the presentation. I'd run through that first.

You most likely are getting more spills than you deserve. The variables to look at:

io.sort.mb - should be most of the task's ram budget
io.sort.record.percent - depends on record size
io.sort.factor - typically 25 * (# of disks / node)

-- Owen

Reply via email to