[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556705#comment-13556705
 ] 

Mariappan Asokan commented on MAPREDUCE-4808:
---------------------------------------------

Hi Arun,
  I will try to explain a simple use case of an external implementation of 
merge on the reduce side.  Let us say this merge implementation has some fixed 
area of memory (Java byte array) allocated to store the shuffled data.  This 
may be done to avoid frequent garbage collection by JVM or for better processor 
cache efficiency.

Looking at the methods in the {{Merge}} class, they either accept input to the 
merge in disk files(array of {{Path}} objects) or memory segments(list of 
{{Segment}} objects.)  The former is not suitable since merge is done in memory 
first and any intermediate merged output file is under the control of the 
plugin implementation.  The latter is not suitable because memory for the 
shuffled data is not under the control of the plugin implementation.

Ideally, if an {{InputStream}} object is available, the external implementation 
can read shuffled data from the stream to the fixed area of memory at a 
specific offset in the byte array.

With the {{MergeManagerPlugin,}} the external implementation will get the HTTP 
connection's {{InputStream}} object via the {{shuffle()}} method in 
{{MapOutput}} object.  In addition, if merge goes though multiple passes 
because the memory area is limited in size, there should be some way for the 
{{Shuffle}} to wait until memory is released by a merge pass.  There is no 
method in {{Merge}} for that either.

I find that it is possible to define the interaction points between current 
{{Shuffle}} and {{MergeManager}} using the {{MergeManagerPlugin}} interface.  
The plugin interface has only three methods and it allows the external plugin 
to have a lot of freedom in its implementation.  As a side effect, the 
{{MapOutput}} is also refactored.

Hope I explained this well.  If you have any questions, please let me know.

-- Asokan

                
> Refactor MapOutput and MergeManager to facilitate reuse by Shuffle 
> implementations
> ----------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4808
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4808
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>            Assignee: Mariappan Asokan
>         Attachments: COMBO-mapreduce-4809-4812-4808.patch, 
> mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, 
> mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, 
> mapreduce-4808.patch, MergeManagerPlugin.pdf
>
>
> Now that Shuffle is pluggable (MAPREDUCE-4049), it would be convenient for 
> alternate implementations to be able to reuse portions of the default 
> implementation. 
> This would come with the strong caveat that these classes are LimitedPrivate 
> and Unstable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to