[
https://issues.apache.org/jira/browse/MAPREDUCE-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552113#comment-13552113
]
Chris Douglas commented on MAPREDUCE-4808:
------------------------------------------
I like many of these changes anyway- the refactoring from tagged unions to a
type hierarchy for {{MapOutput}} is nice- though I'm still not groking why the
new API is necessary. The design doc reads like a summary of the patches and a
motivation for pluggability generally, without directly motivating this
particular moving part. [~masokan], is there code from MAPREDUCE-2454 reviewers
should look at to better understand why this is required?
While much of the shuffle code can be reused between implementations- as Jerry
points out- a common base for all implementations would be overly-ambitious. To
rephrase what Arun has been saying: adding a "plugin" for tweaking the existing
code is not necessary when the whole {{Shuffle}} can be swapped out after
MAPREDUCE-4049.
The only exception I can think of is for {{LocalRunner}}, where the {{Merger}}-
not the {{MergeManager}}- needs to be pluggable. The current patch introduces a
dependency on the {{MergeManager}} for the local case, using only the subset
that applies in that context. Without a use case (e.g., useful for testing),
it's hard to justify the patch's abstraction if all implementations of
{{MergeManager}} are exposed to this special invocation. To be fair, it may be
required for the "ubertask" code to work seamlessly, but that case needs to be
made. Even so: the {{Merger}} seems like the right cut point.
To be concrete, take the RDMA/network merge case as an example. Assume the
overhead of a reader requires one to limit the number of clients/open
connections in the reduce. Implementing a {{Merger}} that keeps a total order
on its segments and only keeps the top _k_ remote segments open would loosely
track that resource. But since the {{Fetcher}} is essentially just packaging
the remote segment and translating the TCE into a {{MapOutput}} (i.e., doesn't
open a connection), it's not obvious why a {{MergeManager}} would coordinate
resource use between them. Even with a hybrid scheme- where a set of
{{Fetcher}} and {{Merger}} instances pull segments locally based on some cost
function, so they *do* share resources- aren't the changes extensive enough to
justify a separate {{Shuffle}} implementation?
The Merger *does* handle multiple merge passes; intermediate merges of segments
are part of its contract. While you may be referring to _continuous_ merging of
segments arriving concurrently with a merge, again: why would it be preferable
to configure a {{MergeManager}} instead of a wholly new {{Shuffle}} impl?
> Allow reduce-side merge to be pluggable
> ---------------------------------------
>
> Key: MAPREDUCE-4808
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4808
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Affects Versions: 2.0.2-alpha
> Reporter: Arun C Murthy
> Assignee: Mariappan Asokan
> Fix For: 2.0.3-alpha
>
> Attachments: COMBO-mapreduce-4809-4812-4808.patch,
> mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch,
> mapreduce-4808.patch, mapreduce-4808.patch, MergeManagerPlugin.pdf
>
>
> Allow reduce-side merge to be pluggable for MAPREDUCE-2454
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira