[jira] [Commented] (MAPREDUCE-4808) Allow reduce-side merge to be pluggable

Chris Douglas (JIRA) Sat, 12 Jan 2013 17:21:13 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552113#comment-13552113
 ]


Chris Douglas commented on MAPREDUCE-4808:
------------------------------------------

I like many of these changes anyway- the refactoring from tagged unions to a 
type hierarchy for {{MapOutput}} is nice- though I'm still not groking why the 
new API is necessary. The design doc reads like a summary of the patches and a 
motivation for pluggability generally, without directly motivating this 
particular moving part. [~masokan], is there code from MAPREDUCE-2454 reviewers 
should look at to better understand why this is required?

While much of the shuffle code can be reused between implementations- as Jerry 
points out- a common base for all implementations would be overly-ambitious. To 
rephrase what Arun has been saying: adding a "plugin" for tweaking the existing 
code is not necessary when the whole {{Shuffle}} can be swapped out after 
MAPREDUCE-4049.

The only exception I can think of is for {{LocalRunner}}, where the {{Merger}}- 
not the {{MergeManager}}- needs to be pluggable. The current patch introduces a 
dependency on the {{MergeManager}} for the local case, using only the subset 
that applies in that context. Without a use case (e.g., useful for testing), 
it's hard to justify the patch's abstraction if all implementations of 
{{MergeManager}} are exposed to this special invocation. To be fair, it may be 
required for the "ubertask" code to work seamlessly, but that case needs to be 
made. Even so: the {{Merger}} seems like the right cut point.

To be concrete, take the RDMA/network merge case as an example. Assume the 
overhead of a reader requires one to limit the number of clients/open 
connections in the reduce. Implementing a {{Merger}} that keeps a total order 
on its segments and only keeps the top _k_ remote segments open would loosely 
track that resource. But since the {{Fetcher}} is essentially just packaging 
the remote segment and translating the TCE into a {{MapOutput}} (i.e., doesn't 
open a connection), it's not obvious why a {{MergeManager}} would coordinate 
resource use between them. Even with a hybrid scheme- where a set of 
{{Fetcher}} and {{Merger}} instances pull segments locally based on some cost 
function, so they *do* share resources- aren't the changes extensive enough to 
justify a separate {{Shuffle}} implementation?

The Merger *does* handle multiple merge passes; intermediate merges of segments 
are part of its contract. While you may be referring to _continuous_ merging of 
segments arriving concurrently with a merge, again: why would it be preferable 
to configure a {{MergeManager}} instead of a wholly new {{Shuffle}} impl?
                
> Allow reduce-side merge to be pluggable
> ---------------------------------------
>
>                 Key: MAPREDUCE-4808
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4808
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.0.2-alpha
>            Reporter: Arun C Murthy
>            Assignee: Mariappan Asokan
>             Fix For: 2.0.3-alpha
>
>         Attachments: COMBO-mapreduce-4809-4812-4808.patch, 
> mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, 
> mapreduce-4808.patch, mapreduce-4808.patch, MergeManagerPlugin.pdf
>
>
> Allow reduce-side merge to be pluggable for MAPREDUCE-2454

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4808) Allow reduce-side merge to be pluggable

Reply via email to