[
https://issues.apache.org/jira/browse/HADOOP-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670118#action_12670118
]
Matei Zaharia commented on HADOOP-1338:
---------------------------------------
Rather than fetching 1% of maps, can we fetch a fixed number (e.g. 10)? My
concern is that if you have 10,000 maps or something, then fetching 1% will
take a while.
Another option to consider is having the JobTracker compute the average map
output size and getting it to the reducers through some other mechanism (e.g.
an RPC like getMapOutputLocations). The JT already has this info. This would
let each reducer work without having to sample and might be simpler. The size
could also be included with each map output location, in which case the system
would work even if maps have wildly different output sizes (not sure how often
this happens).
> Improve the shuffle phase by using the "connection: keep-alive" and doing
> batch transfers of files
> --------------------------------------------------------------------------------------------------
>
> Key: HADOOP-1338
> URL: https://issues.apache.org/jira/browse/HADOOP-1338
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Jothi Padmanabhan
> Attachments: hadoop-1338-v1.patch
>
>
> We should do transfers of map outputs at the granularity of
> *total-bytes-transferred* rather than the current way of transferring a
> single file and then closing the connection to the server. A single
> TaskTracker might have a couple of map output files for a given reduce, and
> we should transfer multiple of them (upto a certain total size) in a single
> connection to the TaskTracker. Using HTTP-1.1's keep-alive connection would
> help since it would keep the connection open for more than one file transfer.
> We should limit the transfers to a certain size so that we don't hold up a
> jetty thread indefinitely (and cause timeouts for other clients).
> Overall, this should give us improved performance.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.