Re: A Case for Stronger Partial Aggregation Semantics in Hadoop

Owen O'Malley Mon, 30 Jul 2007 14:31:51 -0700


On Jul 26, 2007, at 5:02 PM, John M Cieslewicz wrote:

The combiner semantics, however, are the same as the reducer’s and
there is nothing to prevent a programmer from implementing acombiner thatchanges the value of the key or outputs more or less than one key-value
pair.

The combiner and reducer share an interface. However, the semanticsare different. In particular,1. Combiners may be invoked once or many times on each of the mapoutputs, while reduces will be invoked exactly once on each key.2. As a result of that, combiners effectively can not have sideeffects, while reduces can.3. Reduces can emit different types than their inputs, combinerscan not.4. Reduces can change the key, while combiners are required notto. Currently this is not checked dynamically, although it should be.(Things will break badly if combiners do this...)

Note that currently Hadoop invokes the combiner exactly once. Thereis a jira issue filed to fix that. *smile*

This leads to a number of limitations, chief among them the factthat thecombiner cannot be applied more than once because there are noguaranteesregarding the effects of repeatedly using the combiner (asimplemented, the
combiner could produce more than one output pair or change the key).

As I said in the previous point, the combiner can be invoked morethan once and should be. It currently does not. Applications arerequired to keep the combiners pure. I hope it does not break toomany applications when we fix this.

A summary of desirable semantics:
   1 The map function produces as output partial aggregate values
      representing singletons.
2 A new combiner function that explicitly performs partial topartialaggregation over one or more values, creating one new outputvalue of
      the same type as the input value and not changing the key.
3 A reducer which takes as input partial aggregates and producesfinal
      values of any format.

Basically, we already have this, except that we allow the combiner toemit multiple records. Multiple records out of the combiner is not asclean, but in practice I don't think it hurts anything.

This proposal requires a slightly more restrictive combiner, butwith theability to apply this new combiner function repeatedly, one canobtain some
benefits, including:
1 Rather than just combining within a mapper’s output spill, onecouldrepeat the process during the merge of spills, furtherreducing the
      amount of data to be transferred.
   2 The reducer can be more aggressively pipelined with partial
      aggregation occurring among the finished map outputs while the
reducer waits for later map tasks to complete. In thismanner, some
      of the aggregation can be pushed into the sort and merge phases.

You are right that combiners on the reduce side also likely makesense on the output of the merge. The payback is less because thedata isn't likely to be large, but for some applications, it may besignificant.

Do note however, that the combiners are not free. They force theobjects to be serialized and deserialized an extra time and their ownexecution time. In general if the user has asked for them, they willreduce the data, but not always.


-- Owen

Re: A Case for Stronger Partial Aggregation Semantics in Hadoop

Reply via email to