Craig Macdonald created MAPREDUCE-4776:
------------------------------------------

             Summary: Reducer Channels
                 Key: MAPREDUCE-4776
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4776
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
            Reporter: Craig Macdonald


A Google paper on LDA from 2009 -- which can be found at 
http://plda.googlecode.com/files/aaim.pdf -- describes what it terms "reducer 
channels". This is similar to MultipleOutputs, but where the collect() in the 
map task specifies a name of a set of reducers, and the key values are 
forwarded to the appropriate set of reducers. This infers also separate 
combiners and partitioning for each reduce channel. 

It strikes me that while the same affect may be achievable in Hadoop by using 
special keys, this formulation may be more natural. It would better facilitate 
data operations where passes over large data could be condensed into single 
maps with multiple sets of reducers, resulting in lesser mapping jobs.

(For instance, see Figure 2 of the paper, where there are two channels: one for 
data, one for the model.)

I note that from the documentation of MultipleOutputs: "When named outputs are 
used within a Mapper implementation, key/values written to a name output are 
not part of the reduce phase, only key/values written to the job 
OutputCollector are part of the reduce phase."

The proposed change would address this limitation of MultipleOutputs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to