[jira] Commented: (HADOOP-1338) Improve the shuffle phase by using the "connection: keep-alive" and doing batch transfers of files

Devaraj Das (JIRA) Wed, 04 Feb 2009 04:38:30 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670322#action_12670322
 ]


Devaraj Das commented on HADOOP-1338:
-------------------------------------

1) In TaskTracker.MapOutputServlet, 
   # The change where the call to getConf() is moved up, can be removed
   # The "continue" statement in the for loop is redundant. Instead put a 
comment.
   # The code to close the mapOutputFile in the finally block in the current 
codebase needs to remain
2) In ReduceTask, 
   # The URL created for requesting map outputs can be compressed to not have 
repetitive occurrences of the "attempt_<jobid>_m_" strings. Instead, only the 
real attempt ID should be sent and the tasktracker should recreate the full 
attempt ID string. That will ensure we can fetch more without hitting 
HTTP_ENTITY_TOO_LARGE error.
   # No need to pass the maxFetchSizePerHost as a request parameter.
   # Does it make sense to remove numInFlight and instead only base checks on 
uniqueHosts.size() ? I believe even in the current code, the updates to 
numInFlight and uniqueHosts go hand-in-hand..
I am still going through ReduceTask.java .. so I might have more comments .. 

> Improve the shuffle phase by using the "connection: keep-alive" and doing 
> batch transfers of files
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1338
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1338
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>         Attachments: hadoop-1338-v1.patch
>
>
> We should do transfers of map outputs at the granularity of  
> *total-bytes-transferred* rather than the current way of transferring a 
> single file and then closing the connection to the server. A single 
> TaskTracker might have a couple of map output files for a given reduce, and 
> we should transfer multiple of them (upto a certain total size) in a single 
> connection to the TaskTracker. Using HTTP-1.1's keep-alive connection would 
> help since it would keep the connection open for more than one file transfer. 
> We should limit the transfers to a certain size so that we don't hold up a 
> jetty thread indefinitely (and cause timeouts for other clients).
> Overall, this should give us improved performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1338) Improve the shuffle phase by using the "connection: keep-alive" and doing batch transfers of files

Reply via email to