[
https://issues.apache.org/jira/browse/HADOOP-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467705
]
Doug Judd commented on HADOOP-939:
----------------------------------
Bryan ([EMAIL PROTECTED]) writes:
Seems like it wouldn't be more expensive than a few calls to the appropriate
Comparator to figure this out - the OutputCollector merely compares each
output key to the previously output key. If order is preserved, output this
extra truth when the "spill" to disk happens as a header field. If not, you
can stop calling the comparator as soon as output fails to be ordered a
single time. In any case, this means that sorts can be skipped on any output
sequences that are already sorted, and only applied to output sequences that
aren't.
My (doug) thought was to name the intermediate output file with a .sorted
extension if comes out sorted.
As far as Owen's comment goes, the reducer should merge the intermediate files
with the .sorted extension in parallel. The non-sorted ones can get pulled and
sorted in any random order.
> No-sort optimization
> --------------------
>
> Key: HADOOP-939
> URL: https://issues.apache.org/jira/browse/HADOOP-939
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Environment: all
> Reporter: Doug Judd
>
> There should be a way to tell the mapred framework that the output of the
> map() phase will already be sorted. The Reduce phase can just merge the
> intermediate files together without sorting.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.