[ https://issues.apache.org/jira/browse/MAPREDUCE-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556705#comment-13556705 ]
Mariappan Asokan commented on MAPREDUCE-4808: --------------------------------------------- Hi Arun, I will try to explain a simple use case of an external implementation of merge on the reduce side. Let us say this merge implementation has some fixed area of memory (Java byte array) allocated to store the shuffled data. This may be done to avoid frequent garbage collection by JVM or for better processor cache efficiency. Looking at the methods in the {{Merge}} class, they either accept input to the merge in disk files(array of {{Path}} objects) or memory segments(list of {{Segment}} objects.) The former is not suitable since merge is done in memory first and any intermediate merged output file is under the control of the plugin implementation. The latter is not suitable because memory for the shuffled data is not under the control of the plugin implementation. Ideally, if an {{InputStream}} object is available, the external implementation can read shuffled data from the stream to the fixed area of memory at a specific offset in the byte array. With the {{MergeManagerPlugin,}} the external implementation will get the HTTP connection's {{InputStream}} object via the {{shuffle()}} method in {{MapOutput}} object. In addition, if merge goes though multiple passes because the memory area is limited in size, there should be some way for the {{Shuffle}} to wait until memory is released by a merge pass. There is no method in {{Merge}} for that either. I find that it is possible to define the interaction points between current {{Shuffle}} and {{MergeManager}} using the {{MergeManagerPlugin}} interface. The plugin interface has only three methods and it allows the external plugin to have a lot of freedom in its implementation. As a side effect, the {{MapOutput}} is also refactored. Hope I explained this well. If you have any questions, please let me know. -- Asokan > Refactor MapOutput and MergeManager to facilitate reuse by Shuffle > implementations > ---------------------------------------------------------------------------------- > > Key: MAPREDUCE-4808 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4808 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: Arun C Murthy > Assignee: Mariappan Asokan > Attachments: COMBO-mapreduce-4809-4812-4808.patch, > mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, > mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, > mapreduce-4808.patch, MergeManagerPlugin.pdf > > > Now that Shuffle is pluggable (MAPREDUCE-4049), it would be convenient for > alternate implementations to be able to reuse portions of the default > implementation. > This would come with the strong caveat that these classes are LimitedPrivate > and Unstable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira