Craig Macdonald created MAPREDUCE-4776: ------------------------------------------
Summary: Reducer Channels Key: MAPREDUCE-4776 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4776 Project: Hadoop Map/Reduce Issue Type: New Feature Reporter: Craig Macdonald A Google paper on LDA from 2009 -- which can be found at http://plda.googlecode.com/files/aaim.pdf -- describes what it terms "reducer channels". This is similar to MultipleOutputs, but where the collect() in the map task specifies a name of a set of reducers, and the key values are forwarded to the appropriate set of reducers. This infers also separate combiners and partitioning for each reduce channel. It strikes me that while the same affect may be achievable in Hadoop by using special keys, this formulation may be more natural. It would better facilitate data operations where passes over large data could be condensed into single maps with multiple sets of reducers, resulting in lesser mapping jobs. (For instance, see Figure 2 of the paper, where there are two channels: one for data, one for the model.) I note that from the documentation of MultipleOutputs: "When named outputs are used within a Mapper implementation, key/values written to a name output are not part of the reduce phase, only key/values written to the job OutputCollector are part of the reduce phase." The proposed change would address this limitation of MultipleOutputs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira