[ 
https://issues.apache.org/jira/browse/MAPREDUCE-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452293#comment-13452293
 ] 

Harsh J commented on MAPREDUCE-199:
-----------------------------------

bq. Harsh - I'm not familiar with the HBase case; can you please add more 
colour?

Surely!

bq. In this case, won't it be sufficient to schedule maps on the RS? If the 
data is already sorted, but would you try schedule reduces instead?

We have this concept of bulkloads, for example, in HBase, where the Maps read 
in data from a raw source (such as a delimited text file) and passes it to a 
reducer (partitioned by TotalOrderPartitioner based on the region distribution 
of the table in HBase). The sorted data is then written onto a file on HDFS and 
later, injected into the /hbase directory structure for serving.

There's cheap gains (but gains nevertheless) if the data written by the reducer 
is local to the RegionServer hosting that specific partition (region) itself, 
before we bulkload it in.

Likewise, if people have HBase jobs doing a reduce phase for whatever reason, 
and wish to achieve locality such that the reducer task (which emit the keys) 
are local to the regionserver serving the same region for those keys, they can 
do so via a pre-configured job.

There are some use-cases out of HBase as well (I'll let those who've desired 
this comment), but maybe YARN can change those to be outside of MR today.

Or maybe HBase can get a custom AM to do their work in more efficient manner 
than the current MR (MR is easy to use though) - in the long term.

I just think using YARN to write a new app for everything is a slightly longer 
path to take if MR can be harmlessly tweaked a bit more to do the same thing 
along with the other good things it already does.

bq. My concern adding apis/config is that it becomes part of the user interface 
and I'd like to think through it's implications, and whether it's really 
necessary, before we commit to it. Makes sense?

Yes, makes sense on the API side. Partly why I went with a simple config-based 
option on doing this.
                
> Locality hints for Reduce
> -------------------------
>
>                 Key: MAPREDUCE-199
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-199
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: applicationmaster, mrv2
>            Reporter: Benjamin Reed
>            Assignee: Harsh J
>         Attachments: MAPREDUCE-199.patch, MAPREDUCE-199.patch
>
>
> It would be nice if we could add method to OutputFormat that would allow a 
> job to indicate where a reducer for a given partition should should run. This 
> is similar to the getSplits() method on InputFormat. In our application the 
> reducer is using other data in addition to the map outputs during processing 
> and data accesses could be made more efficient if the JobTracker scheduled 
> the reducers to run on specific hosts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to