[
https://issues.apache.org/jira/browse/MAPREDUCE-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452299#comment-13452299
]
Karthik Kambatla commented on MAPREDUCE-199:
--------------------------------------------
This might not be the use case Harsh was thinking of, but here is a use case
from my summer internship a couple of years ago:
Our use case: We were building a topic-based pub/sub system. The published
events were in one HBase table, and the subscriptions were in another table.
While the published events were stored by their published time-stamp, the
subscriptions were stored by <Topic ID: Subscription ID> as the key. Matching
the published events to subscriptions required a join of the two tables on the
topic.
Approach: The map phase reads all the published events and emits (topic, event)
pairs. The reduce's input essentially is all events for a topic - the reduce
reads all the subscriptions of that topic and matches. Now, it would save a lot
of communication if the reduce (for topic A) were scheduled on the same node
that had the subscriptions for the same topic A. Hence, the need for reduce
data-locality.
We achieved this data locality through ugly hacks to the JT to store HBase
region (key-range): host mapping and overloading the partitioner to push each
<key, value> pair to appropriate reducers. I don't remember the exact speedups,
but it was quite significant. (if my memory is not wrong ~2x)
> Locality hints for Reduce
> -------------------------
>
> Key: MAPREDUCE-199
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-199
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: applicationmaster, mrv2
> Reporter: Benjamin Reed
> Assignee: Harsh J
> Attachments: MAPREDUCE-199.patch, MAPREDUCE-199.patch
>
>
> It would be nice if we could add method to OutputFormat that would allow a
> job to indicate where a reducer for a given partition should should run. This
> is similar to the getSplits() method on InputFormat. In our application the
> reducer is using other data in addition to the map outputs during processing
> and data accesses could be made more efficient if the JobTracker scheduled
> the reducers to run on specific hosts.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira