Re: About the combiner execution

Arun C Murthy Sun, 10 Jul 2011 13:38:07 -0700

(Moving to mapreduce-user@, bcc hdfs-user@. Please use appropriate project 
lists - thanks)

On Jul 10, 2011, at 4:42 AM, Florin P wrote:

> Hello!
>  I've read on 
> http://www.fromdev.com/2010/12/interview-questions-hadoop-mapreduce.html 
> (cite):
> "The execution of combiner is not guaranteed, Hadoop may or may not execute a 
> combiner. Also, if required it may execute it more then 1 times. Therefore 
> your MapReduce jobs should not depend on the combiners execution. "
> Is it true? 

Right. The way to visualize is that the MR framework in the map task collects 
the 'raw' (i.e. serialized) map-output key-values in the 'sort' buffer. When 
the buffer is full it runs the combiner (if available) and then spills it to 
the disk, even the last (final) spill. The combiner is also run when the 
multiple spills from disk need to be merged. 

However, the combiner execution also depends on having sufficient number of 
records to combine - this is because combiner execution is somewhat expensive 
since we need a extra serialize-deserialize pair.

Thus, the combiner maybe be run 0 or more times. 

> Also is it possible to use the Combiner without the Reducer? The framework 
> will take into the consideration the Combiner in this case?

No. When the job has no reduces the map-outputs are written straight to HDFS 
(typically) without sorting them. Thus, combiners are never in that execution 
path.

hth,
Arun

Re: About the combiner execution

Reply via email to