Craig Macdonald created MAPREDUCE-4776:
------------------------------------------
Summary: Reducer Channels
Key: MAPREDUCE-4776
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4776
Project: Hadoop Map/Reduce
Issue Type: New Feature
Reporter: Craig Macdonald
A Google paper on LDA from 2009 -- which can be found at
http://plda.googlecode.com/files/aaim.pdf -- describes what it terms "reducer
channels". This is similar to MultipleOutputs, but where the collect() in the
map task specifies a name of a set of reducers, and the key values are
forwarded to the appropriate set of reducers. This infers also separate
combiners and partitioning for each reduce channel.
It strikes me that while the same affect may be achievable in Hadoop by using
special keys, this formulation may be more natural. It would better facilitate
data operations where passes over large data could be condensed into single
maps with multiple sets of reducers, resulting in lesser mapping jobs.
(For instance, see Figure 2 of the paper, where there are two channels: one for
data, one for the model.)
I note that from the documentation of MultipleOutputs: "When named outputs are
used within a Mapper implementation, key/values written to a name output are
not part of the reduce phase, only key/values written to the job
OutputCollector are part of the reduce phase."
The proposed change would address this limitation of MultipleOutputs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira