[
https://issues.apache.org/jira/browse/HADOOP-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476514
]
Hadoop QA commented on HADOOP-1043:
-----------------------------------
+1, because
http://issues.apache.org/jira/secure/attachment/12352089/1043.patch</a>)
against trunk revision <a href= applied and successfully tested against trunk
revision http://svn.apache.org/repos/asf/lucene/hadoop/trunk/512499. Results
are at http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch
> Optimize the shuffle phase (increase the parallelism)
> -----------------------------------------------------
>
> Key: HADOOP-1043
> URL: https://issues.apache.org/jira/browse/HADOOP-1043
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Devaraj Das
> Assigned To: Devaraj Das
> Attachments: 1043.patch
>
>
> In the current shuffle code, only one map output location node is accessed
> from any Reduce at any given point of time. For example, if a particular
> node, say machine1.foo.com ran 300 maps, the reducer would fetch just one
> output from there at a time. machine1.foo.com will be inserted into a Set
> datastructure (uniqueHosts) and until it gets removed from there, no other
> map output will be fetched from that machine. The fact that only one map
> output is fetched at a time from any particular host seems fine, but the
> logic for removing a node from uniqueHosts is such that there could be a lot
> of delay before a node gets deleted from the Set datastructure (even after
> the map output has been fetched from that node). This probably leads to
> suboptimal performance since it reduces the parallelism in fetching.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.