[ 
https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated HADOOP-3446:
----------------------------------

    Attachment: 3446-0.patch

I tested this on a 100 node cluster (98 tasktrackers) using sort. Given 
300MB/node of data and a sufficiently large io.sort.mb and fs.inmemory.size.mb, 
io.sort.spill.percent=1.0, fs.inmemory.merge.threshold=0, and 
mapred.inmem.usage=1.0, each reduce took an average of 121 seconds when reading 
from disk vs 79 seconds merging and reducing from memory. While the sort with 
the patch finished the job in 8 minutes instead of 9, both had slow 
tasktrackers that threw off the running time.

This also includes some similar changes to MapTask, letting the record and 
serialization buffer soft limits be configured separately.

> The reduce task should not flush the in memory file system before starting 
> the reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3446
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3446
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>            Priority: Critical
>         Attachments: 3446-0.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the 
> input to disk and re-read it before giving it to the reducer. It would be 
> much better if we merged from the ramfs and any spills to feed the reducer 
> its input.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to