[jira] [Commented] (MAPREDUCE-6197) Cache MapOutputLocations in ShuffleHandler

Junping Du (JIRA) Thu, 16 Jun 2016 06:41:59 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333791#comment-15333791
 ]


Junping Du commented on MAPREDUCE-6197:
---------------------------------------

Thanks [~jianhe] for review and comments.
bq. one question is how/why do you choose such policy for determining the 
weight?
That's good question. To control cache size of a LoadingCache, we can either to 
use maximumSize directly or maximumWeight. The reason to choose maximumWeight 
instead of maximumSize is each cache item here is a flexible size which depends 
on {{key size + value size}}. It means if we use a fixed maximumSize, we still 
not sure how much memory it could end up with. The another reason is to keep 
consistent with what we have in HIVE-9912. If we found any issue with current 
settings/policy in large production deployment in future, we can change both 
side together.

> Cache MapOutputLocations in ShuffleHandler
> ------------------------------------------
>
>                 Key: MAPREDUCE-6197
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6197
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Siddharth Seth
>            Assignee: Junping Du
>         Attachments: MAPREDUCE-6197.patch
>
>
> ShuffleHandler currently seems to create a map of mapId - mapInfo (file.out / 
> index information) when it receives a message.
> This should be caching map info across requests, so that the a scan of all 
> directories is not required for each reducer fetching from the same map.
> Also, the scan for each map output / index file is performed twice per mapId 
> within a request. In populateHeaders - once in the call to getMapOutputInfo, 
> and then directly in the method.
> For an invocation where we do end up with more than 1000 (default) mapIds in 
> a single call, and don't cache them in the map - the path constructed for 
> such entries will be invalid. This is highly unlikely to be the case though, 
> until there's proper caching.
> {code}
> MapOutputInfo info = mapOutputInfoMap.get(mapId);
>           if (info == null) {
>             info = getMapOutputInfo(outputBasePathStr, mapId, reduceId, user);
>           }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-6197) Cache MapOutputLocations in ShuffleHandler

Reply via email to