The argument of using local combiners is interesting. To me, combiner class
is just another layer of transformer. It does not mean that the combiner
class has to be the same as the reducer class. The only criteria is that
they meet the associate rule:
Let L1, L2, ..., Ln and K1, K2, .., Km be two partitions of S, then
Reduce(list(Combiner(L1), Combiner(L2),..., Combiner(Ln))) and
Reduce(list(Combiner(K1), Combiner(K2), ..., Combiner(Km)) are the
same.
A special (maybe very common) scenario is that combiner and reducer are the
same class and reduce function is associate. However, this needs not to be
the case in general. And the class of the reduce outputs need not to be the
same as that of the combiner, if the combiner and the reducer are not the
same class.
Runping
-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 02, 2006 1:30 PM
To: [email protected]
Subject: Re: [jira] Commented: (HADOOP-115) Hadoop should allow the user to
use SequentialFileOutputformat as the output format and to choose key/value
classes that are different from those for map output.
Eric Baldeschwieler wrote:
> I can not think of a case where this proposed extension complicates
> code or reduces compressibility. Since it is backwards compatible with
> your desired API, purists can simply ignore the option.
It makes the insertion of a combiner no longer transparent. The reducer
would have to know whether a combiner had been used in order to know how
to process the map output.
In general this seems like a micro-optimization. It saves little code.
Instead of writing 'collector.collect(key, new List(value))' one could
write 'collector.collect(key, value)'.
Taking this to its logical extreme, in the classic word-count use of
MapReduce, why should one have to emit ones for the map values? Why
have a value at all? Why not add a collect(key) method, then permit
reducers to be passed an iterator which returns null for all values
where collect(key) was called. That would save a little code and make
the intermediate data a bit smaller. So should we do it? I'd argue not.
Doug