[
https://issues.apache.org/jira/browse/HADOOP-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661839#action_12661839
]
Jothi Padmanabhan commented on HADOOP-1338:
-------------------------------------------
Yes, ideally the number of maps fetched should be implicitly determined by the
total size of map outputs fetched and should not be a fixed number. However, at
the Reducer side, we do not know the size of the map outputs beforehand and the
reducer needs to request specific map ids -- It cannot just specify a single
size as the TT will not know which maps have already been fetched by a given
reducer and which have not. So, we might need to use a compromise - the reducer
requests say 10 maps with their ids and also specifies the total size that it
is willing to accept. The TT then sends as many map ids as that would fit into
that size. We of course can tune this approach later. Thoughts?
> Improve the shuffle phase by using the "connection: keep-alive" and doing
> batch transfers of files
> --------------------------------------------------------------------------------------------------
>
> Key: HADOOP-1338
> URL: https://issues.apache.org/jira/browse/HADOOP-1338
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Jothi Padmanabhan
>
> We should do transfers of map outputs at the granularity of
> *total-bytes-transferred* rather than the current way of transferring a
> single file and then closing the connection to the server. A single
> TaskTracker might have a couple of map output files for a given reduce, and
> we should transfer multiple of them (upto a certain total size) in a single
> connection to the TaskTracker. Using HTTP-1.1's keep-alive connection would
> help since it would keep the connection open for more than one file transfer.
> We should limit the transfers to a certain size so that we don't hold up a
> jetty thread indefinitely (and cause timeouts for other clients).
> Overall, this should give us improved performance.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.