[jira] Created: (HADOOP-363) When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.

Dick King (JIRA) Fri, 14 Jul 2006 10:17:22 -0700

When combiners exist, postpone mappers' spills of map output to disk until 
combiners are unsuccessful.
------------------------------------------------------------------------------------------------------


                 Key: HADOOP-363
                 URL: http://issues.apache.org/jira/browse/HADOOP-363
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
            Reporter: Dick King


When a map/reduce job is set up with a combiner, the mapper tasks each build up 
an in-heap collection of 100K key/value pairs -- and then apply the combiner to 
reduce that to whatever it becomes by applying the combiner to sets with like 
keys before spilling to disk to send it to the reducers.

Typically running the combiner consumes a lot less resources than shipping the 
data, especially since the data end up in a reducer where probably the same 
code will be run anyway.

I would like to see this changed so that when the combiner shrinks the 100K 
key/value pairs to less than, say, 90K, we just keep running the mapper and 
combiner alternately until we get enough distinct keys to make this unlikely to 
be worthwhile [or until we run out of input, of course].

This has two costs: the whole internal buffer has to be re-sorted so we can 
apply the combiner even though as few as 10K new elements have been added, and 
in some cases we'll call the combiner on many singletons.  

The first of these costs can be avoided by doing a mini-sort in the new pairs 
section and doing a merge to develop the combiner sets and the new sorted 
retained elements section.

The second of these costs can be avoided by detecting what would otherwise be 
singleton combiner calls and not making them, which is a good idea in itself 
even if we don't decide to do this reform.

The two techniques combine well; recycled elements of the buffer need not be 
combined if there's no new element with the same key.

-dk



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (HADOOP-363) When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.

Reply via email to