[
https://issues.apache.org/jira/browse/HADOOP-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Douglas updated HADOOP-3446:
----------------------------------
Attachment: 3446-0.patch
I tested this on a 100 node cluster (98 tasktrackers) using sort. Given
300MB/node of data and a sufficiently large io.sort.mb and fs.inmemory.size.mb,
io.sort.spill.percent=1.0, fs.inmemory.merge.threshold=0, and
mapred.inmem.usage=1.0, each reduce took an average of 121 seconds when reading
from disk vs 79 seconds merging and reducing from memory. While the sort with
the patch finished the job in 8 minutes instead of 9, both had slow
tasktrackers that threw off the running time.
This also includes some similar changes to MapTask, letting the record and
serialization buffer soft limits be configured separately.
> The reduce task should not flush the in memory file system before starting
> the reducer
> --------------------------------------------------------------------------------------
>
> Key: HADOOP-3446
> URL: https://issues.apache.org/jira/browse/HADOOP-3446
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Priority: Critical
> Attachments: 3446-0.patch
>
>
> In the case where the entire reduce inputs fit in ram, we currently force the
> input to disk and re-read it before giving it to the reducer. It would be
> much better if we merged from the ramfs and any spills to feed the reducer
> its input.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.