On Jul 26, 2007, at 5:02 PM, John M Cieslewicz wrote:
The combiner semantics, however, are the same as the reducer’s and
there is nothing to prevent a programmer from implementing a
combiner that
changes the value of the key or outputs more or less than one key-
value
pair.
The combiner and reducer share an interface. However, the semantics
are different. In particular,
1. Combiners may be invoked once or many times on each of the map
outputs, while reduces will be invoked exactly once on each key.
2. As a result of that, combiners effectively can not have side
effects, while reduces can.
3. Reduces can emit different types than their inputs, combiners
can not.
4. Reduces can change the key, while combiners are required not
to. Currently this is not checked dynamically, although it should be.
(Things will break badly if combiners do this...)
Note that currently Hadoop invokes the combiner exactly once. There
is a jira issue filed to fix that. *smile*
This leads to a number of limitations, chief among them the fact
that the
combiner cannot be applied more than once because there are no
guarantees
regarding the effects of repeatedly using the combiner (as
implemented, the
combiner could produce more than one output pair or change the key).
As I said in the previous point, the combiner can be invoked more
than once and should be. It currently does not. Applications are
required to keep the combiners pure. I hope it does not break too
many applications when we fix this.
A summary of desirable semantics:
1 The map function produces as output partial aggregate values
representing singletons.
2 A new combiner function that explicitly performs partial to
partial
aggregation over one or more values, creating one new output
value of
the same type as the input value and not changing the key.
3 A reducer which takes as input partial aggregates and produces
final
values of any format.
Basically, we already have this, except that we allow the combiner to
emit multiple records. Multiple records out of the combiner is not as
clean, but in practice I don't think it hurts anything.
This proposal requires a slightly more restrictive combiner, but
with the
ability to apply this new combiner function repeatedly, one can
obtain some
benefits, including:
1 Rather than just combining within a mapper’s output spill, one
could
repeat the process during the merge of spills, further
reducing the
amount of data to be transferred.
2 The reducer can be more aggressively pipelined with partial
aggregation occurring among the finished map outputs while the
reducer waits for later map tasks to complete. In this
manner, some
of the aggregation can be pushed into the sort and merge phases.
You are right that combiners on the reduce side also likely make
sense on the output of the merge. The payback is less because the
data isn't likely to be large, but for some applications, it may be
significant.
Do note however, that the combiners are not free. They force the
objects to be serialized and deserialized an extra time and their own
execution time. In general if the user has asked for them, they will
reduce the data, but not always.
-- Owen