Doug's calculation shows that the total gain can be only 1/3 (15 are unavoidable, and taking advantage of largely pre-sorted input reduces overhead from 12/27 to 3/18, so the maximum total gain is 27->18.)

Does this model assume that the size of the output of reduce is similar to the size of the input?

An important class of applications (mentioned in this thread before) uses two inputs: -- M ("master file") -- very large, presorted and not changing from run to run, -- D ("details file") -- smaller, different from run to run, not necessarily presorted
and the output size is proportional to the size of D.
In this case the gain from "no-sort" may be much higher, as the 13 "transfer and write" to DFS are applied to a smaller amount of data, while 11 (b-d) sort-n-shuffle-related are saved on the larger data).


On Jan 25, 2007, at 5:21 PM, Doug Cutting (JIRA) wrote:


[ https://issues.apache.org/jira/browse/HADOOP-939? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel#action_12467717 ]

Doug Cutting commented on HADOOP-939:
-------------------------------------

I suspect that most of the performance gains to be had by declaring input to be sorted can also be had by using heuristics that also speed things when input is only nearly sorted. (By "nearly sorted" I mean things like merging a set of updates into a sorted database, e.g., the crawl db update task in Nutch.)

Eric Baldeschwieler proposed a simple model for MapReduce performance. If you assume that disks can read and write at 100MB/s, and that nodes can talk within rack at 100MB/s (Gb/s) and to nodes in another rack at 10MB/s, then a MapReduce requires the following number of seconds per 100MB. (Note that this assumes various sort optimizations that are already in progress, where map outputs are buffered and sorted before they're spilled to the local disk on map nodes, and reduce inputs are buffered and merged before they're spilled to the local disk on the reduce node, so that, in many cases, reduce can proceed without an explicit sort stage but simply by merging a set of already sorted input files from the local disk.)

a.  1 read input data from local drive on map node
[ map ]
b.  1 write batches of sorted output data to temporary file on map node
c. 10 shuffle batches of sorted data to reduce node
d.  1 write batches of sorted data to reduce node
[ reduce]
e.  1 write one copy of output locally
f.  2 transfer and write one copy to another node on the same rack
g. 11 transfer and write one copy to an off-rack node

So the total is 27s/100MB. Only two of those are really sort-specific, (b) and (d). 14 (more than half) are unavoidable.

The biggest chunk of fat to go after for pre-sorted input is (c). This can be eliminated if maps can be placed near reduces. For example, tasktrackers might report the size of each partition they're generating and the jobtracker might use this to schedule reduces on racks which already have a lot of their input.


No-sort optimization
--------------------

                Key: HADOOP-939
                URL: https://issues.apache.org/jira/browse/HADOOP-939
            Project: Hadoop
         Issue Type: New Feature
         Components: mapred
        Environment: all
           Reporter: Doug Judd

There should be a way to tell the mapred framework that the output of the map() phase will already be sorted. The Reduce phase can just merge the intermediate files together without sorting.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Reply via email to