On Jul 26, 2007, at 5:02 PM, John M Cieslewicz wrote:
The combiner semantics, however, are the same as the reducer’s and
there is nothing to prevent a programmer from implementing a combiner that changes the value of the key or outputs more or less than one key- value
pair.

The combiner and reducer share an interface. However, the semantics are different. In particular, 1. Combiners may be invoked once or many times on each of the map outputs, while reduces will be invoked exactly once on each key. 2. As a result of that, combiners effectively can not have side effects, while reduces can. 3. Reduces can emit different types than their inputs, combiners can not. 4. Reduces can change the key, while combiners are required not to. Currently this is not checked dynamically, although it should be. (Things will break badly if combiners do this...)

Note that currently Hadoop invokes the combiner exactly once. There is a jira issue filed to fix that. *smile*

This leads to a number of limitations, chief among them the fact that the combiner cannot be applied more than once because there are no guarantees regarding the effects of repeatedly using the combiner (as implemented, the
combiner could produce more than one output pair or change the key).

As I said in the previous point, the combiner can be invoked more than once and should be. It currently does not. Applications are required to keep the combiners pure. I hope it does not break too many applications when we fix this.

A summary of desirable semantics:
   1 The map function produces as output partial aggregate values
      representing singletons.
2 A new combiner function that explicitly performs partial to partial aggregation over one or more values, creating one new output value of
      the same type as the input value and not changing the key.
3 A reducer which takes as input partial aggregates and produces final
      values of any format.

Basically, we already have this, except that we allow the combiner to emit multiple records. Multiple records out of the combiner is not as clean, but in practice I don't think it hurts anything.

This proposal requires a slightly more restrictive combiner, but with the ability to apply this new combiner function repeatedly, one can obtain some
benefits, including:
1 Rather than just combining within a mapper’s output spill, one could repeat the process during the merge of spills, further reducing the
      amount of data to be transferred.
   2 The reducer can be more aggressively pipelined with partial
      aggregation occurring among the finished map outputs while the
reducer waits for later map tasks to complete. In this manner, some
      of the aggregation can be pushed into the sort and merge phases.

You are right that combiners on the reduce side also likely make sense on the output of the merge. The payback is less because the data isn't likely to be large, but for some applications, it may be significant.

Do note however, that the combiners are not free. They force the objects to be serialized and deserialized an extra time and their own execution time. In general if the user has asked for them, they will reduce the data, but not always.

-- Owen


Reply via email to