On Feb 10, 2009, at 5:20 PM, Matei Zaharia wrote:
I'd like to write a combiner that shares a lot of code with a reducer,
except that the reducer updates an external database at the end.
The right way to do this is to either do the update in the output
format or do something like:
class MyCombiner implements Reducer {
...
public void close() throws IOException {}
}
class MyReduer extends MyCombiner {
...
public void close() throws IOException { ... update database ... }
}
As far as I
can tell, since both combiners and reducers must implement the Reducer
interface, there is no way to have this be the same class.
There are ways to do it, but they are likely to change.
Is there a
recommended way to test inside the task whether you're running as a
combiner
(in a map task) or as a reducer?
The question is worse than you think. In particular, the question is
*not* are you in a map or reduce task. With current versions of
Hadoop, the combiner can be called in the context of the reduce as
well as the map. You really want to know if you are in a Reducer or
Combiner context.
If not, I think this might be an interesting thing to support in the
Hadoop
1.0 API.
It probably does make sense to add to ReduceContext.isCombiner() to
answer the question. In practice, usually if someone wants to use
*almost* the same code for combiner and reducer, I get suspicious of
their design.
It would enable people to write an AbstractJob class where you just
implement map, combine and reduce functions, and can thus write
MapReduce
jobs in a single Java class.
The old api allowed this, since both Mapper and Reducer were
interfaces. The new api doesn't because they are both classes. It
wouldn't be hard to make a set of adaptors in library code that would
work. Basically, you would define a job with SimpleMapper,
SimpleCombiner, and SimpleReducer that would call Task.map,
Task.combine, and Task.reduce.
-- Owen