[jira] [Commented] (MAPREDUCE-4039) Sort Avoidance

Schubert Zhang (Commented) (JIRA) Tue, 20 Mar 2012 02:54:05 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233334#comment-13233334
 ]


Schubert Zhang commented on MAPREDUCE-4039:
-------------------------------------------

Nice summary Anty, and expect your patch on 0.23.

The more detailed description and benchmark on this feature, you can refer to 
http://www.slideshare.net/hanborq/hanborq-optimizations-on-hadoop-mapreduce-20120216a
 or 
http://www.slideshare.net/schubertzhang/hanborq-optimizations-on-hadoop-map-reduce-20120221a
 
                
> Sort Avoidance
> --------------
>
>                 Key: MAPREDUCE-4039
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4039
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: anty.rao
>            Priority: Minor
>             Fix For: 0.23.2
>
>
> Inspired by 
> [Tenzing|http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/37200.pdf],
>  in 5.1 MapReduce Enhanceemtns:
> {quote}*Sort Avoidance*. Certain operators such as hash join
> and hash aggregation require shuffling, but not sorting. The
> MapReduce API was enhanced to automatically turn off
> sorting for these operations. When sorting is turned off, the
> mapper feeds data to the reducer which directly passes the
> data to the Reduce() function bypassing the intermediate
> sorting step. This makes many SQL operators significantly
> more ecient.{quote}
> There are a lot of applications which need aggregation only, not 
> sorting.Using sorting to achieve aggregation is costly and inefficient. 
> Without sorting, up application can make use of hash table or hash map to do 
> aggregation efficiently.But application should bear in mind that reduce 
> memory is limited, itself is committed to manage memory of reduce, guard 
> against out of memory. Map-side combiner is not supported, you can also do 
> hash aggregation in map side  as a workaround.
> the following is the main points of sort avoidance implementation
> # add a configuration parameter ??mapreduce.sort.avoidance??, boolean type, 
> to turn on/off sort avoidance workflow.Two type of workflow are coexist 
> together.
> # key/value pairs emitted by map function is sorted by partition only, using 
> a more efficient sorting algorithm: counting sort.
> # map-side merge, use a kind of byte merge, which just concatenate bytes from 
> generated spills, read in bytes, write out bytes, without overhead of 
> key/value serialization/deserailization, comparison, which current version 
> incurs.
> # reduce can start up as soon as there is any map output available, in 
> contrast to sort workflow which must wait until all map outputs are fetched 
> and merged.
> # map output in memory can be directly consumed by reduce.When reduce can't 
> catch up with the speed of incoming map outputs, in-memory merge thread will 
> kick in, merging in-memory map outputs onto disk.
> # sequentially read in on-disk files to feed reduce, in contrast to currently 
> implementation which read multiple files concurrently, result in many disk 
> seek. Map output in memory take precedence over on disk files in feeding 
> reduce function.
> I have already implement this feature based on hadoop CDH3U3 and done some 
> performance evaluation, you can reference to 
> [https://github.com/hanborq/hadoop] for details. Now,I'm willing to port it 
> into yarn. Welcome for commenting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4039) Sort Avoidance

Reply via email to