[ https://issues.apache.org/jira/browse/MAPREDUCE-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452299#comment-13452299 ]
Karthik Kambatla commented on MAPREDUCE-199: -------------------------------------------- This might not be the use case Harsh was thinking of, but here is a use case from my summer internship a couple of years ago: Our use case: We were building a topic-based pub/sub system. The published events were in one HBase table, and the subscriptions were in another table. While the published events were stored by their published time-stamp, the subscriptions were stored by <Topic ID: Subscription ID> as the key. Matching the published events to subscriptions required a join of the two tables on the topic. Approach: The map phase reads all the published events and emits (topic, event) pairs. The reduce's input essentially is all events for a topic - the reduce reads all the subscriptions of that topic and matches. Now, it would save a lot of communication if the reduce (for topic A) were scheduled on the same node that had the subscriptions for the same topic A. Hence, the need for reduce data-locality. We achieved this data locality through ugly hacks to the JT to store HBase region (key-range): host mapping and overloading the partitioner to push each <key, value> pair to appropriate reducers. I don't remember the exact speedups, but it was quite significant. (if my memory is not wrong ~2x) > Locality hints for Reduce > ------------------------- > > Key: MAPREDUCE-199 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-199 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: applicationmaster, mrv2 > Reporter: Benjamin Reed > Assignee: Harsh J > Attachments: MAPREDUCE-199.patch, MAPREDUCE-199.patch > > > It would be nice if we could add method to OutputFormat that would allow a > job to indicate where a reducer for a given partition should should run. This > is similar to the getSplits() method on InputFormat. In our application the > reducer is using other data in addition to the map outputs during processing > and data accesses could be made more efficient if the JobTracker scheduled > the reducers to run on specific hosts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira