[ 
https://issues.apache.org/jira/browse/MAPREDUCE-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452299#comment-13452299
 ] 

Karthik Kambatla commented on MAPREDUCE-199:
--------------------------------------------

This might not be the use case Harsh was thinking of, but here is a use case 
from my summer internship a couple of years ago:

Our use case: We were building a topic-based pub/sub system. The published 
events were in one HBase table, and the subscriptions were in another table. 
While the published events were stored by their published time-stamp, the 
subscriptions were stored by <Topic ID: Subscription ID> as the key. Matching 
the published events to subscriptions required a join of the two tables on the 
topic.

Approach: The map phase reads all the published events and emits (topic, event) 
pairs. The reduce's input essentially is all events for a topic - the reduce 
reads all the subscriptions of that topic and matches. Now, it would save a lot 
of communication if the reduce (for topic A) were scheduled on the same node 
that had the subscriptions for the same topic A. Hence, the need for reduce  
data-locality.

We achieved this data locality through ugly hacks to the JT to store HBase 
region (key-range): host mapping and overloading the partitioner to push each 
<key, value> pair to appropriate reducers. I don't remember the exact speedups, 
but it was quite significant. (if my memory is not wrong ~2x) 
                
> Locality hints for Reduce
> -------------------------
>
>                 Key: MAPREDUCE-199
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-199
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: applicationmaster, mrv2
>            Reporter: Benjamin Reed
>            Assignee: Harsh J
>         Attachments: MAPREDUCE-199.patch, MAPREDUCE-199.patch
>
>
> It would be nice if we could add method to OutputFormat that would allow a 
> job to indicate where a reducer for a given partition should should run. This 
> is similar to the getSplits() method on InputFormat. In our application the 
> reducer is using other data in addition to the map outputs during processing 
> and data accesses could be made more efficient if the JobTracker scheduled 
> the reducers to run on specific hosts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to