[ https://issues.apache.org/jira/browse/HADOOP-910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Devaraj Das reassigned HADOOP-910: ---------------------------------- Assignee: Amar Kamat (was: Gautam Kowshik) > Reduces can do merges for the on-disk map output files in parallel with their > copying > ------------------------------------------------------------------------------------- > > Key: HADOOP-910 > URL: https://issues.apache.org/jira/browse/HADOOP-910 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Reporter: Devaraj Das > Assignee: Amar Kamat > > Proposal to extend the parallel in-memory-merge/copying, that is being done > as part of HADOOP-830, to the on-disk files. > Today, the Reduces dump the map output files to disk and the final merge > happens only after all the map outputs have been collected. It might make > sense to parallelize this part. That is, whenever a Reduce has collected > io.sort.factor number of segments on disk, it initiates a merge of those and > creates one big segment. If the rate of copying is faster than the merge, we > can probably have multiple threads doing parallel merges of independent sets > of io.sort.factor number of segments. If the rate of copying is not as fast > as merge, we stand to gain a lot - at the end of copying of all the map > outputs, we will be left with a small number of segments for the final merge > (which hopefully will feed the reduce directly (via the RawKeyValueIterator) > without having to hit the disk for writing additional output segments). > If the disk bandwidth is higher than the network bandwidth, we have a good > story, I guess, to do such a thing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.