[
https://issues.apache.org/jira/browse/HADOOP-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Devaraj Das updated HADOOP-1043:
--------------------------------
Attachment: 1043.patch
This patch looks at all the available CopyResult objects from the copyResults
list before querying the JobTracker for new map output locations.
> Optimize the shuffle phase (increase the parallelism)
> -----------------------------------------------------
>
> Key: HADOOP-1043
> URL: https://issues.apache.org/jira/browse/HADOOP-1043
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: Devaraj Das
> Assigned To: Devaraj Das
> Attachments: 1043.patch
>
>
> In the current shuffle code, only one map output location node is accessed
> from any Reduce at any given point of time. For example, if a particular
> node, say machine1.foo.com ran 300 maps, the reducer would fetch just one
> output from there at a time. machine1.foo.com will be inserted into a Set
> datastructure (uniqueHosts) and until it gets removed from there, no other
> map output will be fetched from that machine. The fact that only one map
> output is fetched at a time from any particular host seems fine, but the
> logic for removing a node from uniqueHosts is such that there could be a lot
> of delay before a node gets deleted from the Set datastructure (even after
> the map output has been fetched from that node). This probably leads to
> suboptimal performance since it reduces the parallelism in fetching.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.