[ https://issues.apache.org/jira/browse/MAPREDUCE-4808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552113#comment-13552113 ]
Chris Douglas commented on MAPREDUCE-4808: ------------------------------------------ I like many of these changes anyway- the refactoring from tagged unions to a type hierarchy for {{MapOutput}} is nice- though I'm still not groking why the new API is necessary. The design doc reads like a summary of the patches and a motivation for pluggability generally, without directly motivating this particular moving part. [~masokan], is there code from MAPREDUCE-2454 reviewers should look at to better understand why this is required? While much of the shuffle code can be reused between implementations- as Jerry points out- a common base for all implementations would be overly-ambitious. To rephrase what Arun has been saying: adding a "plugin" for tweaking the existing code is not necessary when the whole {{Shuffle}} can be swapped out after MAPREDUCE-4049. The only exception I can think of is for {{LocalRunner}}, where the {{Merger}}- not the {{MergeManager}}- needs to be pluggable. The current patch introduces a dependency on the {{MergeManager}} for the local case, using only the subset that applies in that context. Without a use case (e.g., useful for testing), it's hard to justify the patch's abstraction if all implementations of {{MergeManager}} are exposed to this special invocation. To be fair, it may be required for the "ubertask" code to work seamlessly, but that case needs to be made. Even so: the {{Merger}} seems like the right cut point. To be concrete, take the RDMA/network merge case as an example. Assume the overhead of a reader requires one to limit the number of clients/open connections in the reduce. Implementing a {{Merger}} that keeps a total order on its segments and only keeps the top _k_ remote segments open would loosely track that resource. But since the {{Fetcher}} is essentially just packaging the remote segment and translating the TCE into a {{MapOutput}} (i.e., doesn't open a connection), it's not obvious why a {{MergeManager}} would coordinate resource use between them. Even with a hybrid scheme- where a set of {{Fetcher}} and {{Merger}} instances pull segments locally based on some cost function, so they *do* share resources- aren't the changes extensive enough to justify a separate {{Shuffle}} implementation? The Merger *does* handle multiple merge passes; intermediate merges of segments are part of its contract. While you may be referring to _continuous_ merging of segments arriving concurrently with a merge, again: why would it be preferable to configure a {{MergeManager}} instead of a wholly new {{Shuffle}} impl? > Allow reduce-side merge to be pluggable > --------------------------------------- > > Key: MAPREDUCE-4808 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4808 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Affects Versions: 2.0.2-alpha > Reporter: Arun C Murthy > Assignee: Mariappan Asokan > Fix For: 2.0.3-alpha > > Attachments: COMBO-mapreduce-4809-4812-4808.patch, > mapreduce-4808.patch, mapreduce-4808.patch, mapreduce-4808.patch, > mapreduce-4808.patch, mapreduce-4808.patch, MergeManagerPlugin.pdf > > > Allow reduce-side merge to be pluggable for MAPREDUCE-2454 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira