[ https://issues.apache.org/jira/browse/MAPREDUCE-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445328#comment-13445328 ]
Mariappan Asokan commented on MAPREDUCE-4039: --------------------------------------------- Hi Anty, Sorry, I did not get back to you. Please take a look at the patch for MAPREDUCE-2454. I added a test that has a very simple implementation of NullSortPlugin. You can take a look at the code there. -- Asokan > Sort Avoidance > -------------- > > Key: MAPREDUCE-4039 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4039 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: mrv2 > Affects Versions: 0.23.2 > Reporter: anty.rao > Assignee: anty > Priority: Minor > Fix For: 0.23.2 > > Attachments: IndexedCountingSortable.java, > MAPREDUCE-4039-branch-0.23.2.patch, MAPREDUCE-4039-branch-0.23.2.patch, > MAPREDUCE-4039-branch-0.23.2.patch > > > Inspired by > [Tenzing|http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/37200.pdf], > in 5.1 MapReduce Enhanceemtns: > {quote}*Sort Avoidance*. Certain operators such as hash join > and hash aggregation require shuffling, but not sorting. The > MapReduce API was enhanced to automatically turn off > sorting for these operations. When sorting is turned off, the > mapper feeds data to the reducer which directly passes the > data to the Reduce() function bypassing the intermediate > sorting step. This makes many SQL operators significantly > more ecient.{quote} > There are a lot of applications which need aggregation only, not > sorting.Using sorting to achieve aggregation is costly and inefficient. > Without sorting, up application can make use of hash table or hash map to do > aggregation efficiently.But application should bear in mind that reduce > memory is limited, itself is committed to manage memory of reduce, guard > against out of memory. Map-side combiner is not supported, you can also do > hash aggregation in map side as a workaround. > the following is the main points of sort avoidance implementation > # add a configuration parameter ??mapreduce.sort.avoidance??, boolean type, > to turn on/off sort avoidance workflow.Two type of workflow are coexist > together. > # key/value pairs emitted by map function is sorted by partition only, using > a more efficient sorting algorithm: counting sort. > # map-side merge, use a kind of byte merge, which just concatenate bytes from > generated spills, read in bytes, write out bytes, without overhead of > key/value serialization/deserailization, comparison, which current version > incurs. > # reduce can start up as soon as there is any map output available, in > contrast to sort workflow which must wait until all map outputs are fetched > and merged. > # map output in memory can be directly consumed by reduce.When reduce can't > catch up with the speed of incoming map outputs, in-memory merge thread will > kick in, merging in-memory map outputs onto disk. > # sequentially read in on-disk files to feed reduce, in contrast to currently > implementation which read multiple files concurrently, result in many disk > seek. Map output in memory take precedence over on disk files in feeding > reduce function. > I have already implement this feature based on hadoop CDH3U3 and done some > performance evaluation, you can reference to > [https://github.com/hanborq/hadoop] for details. Now,I'm willing to port it > into yarn. Welcome for commenting. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira