[jira] Commented: (HADOOP-946) Map directly to HDFS or reduce()

Doug Cutting (JIRA) Mon, 29 Jan 2007 12:31:11 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468399
 ]


Doug Cutting commented on HADOOP-946:
-------------------------------------

> Seems like you'd need an atomic record append API to handle this.

Using the numbers from HADOOP-939, one could eliminate steps (c) and (d), 
saving 11 out of 27 seconds, by locating reduces near maps.  If you had atomic 
append then you could save an additional second or two by not buffering things 
locally.  So there'd be some improvement by eliminating reduce altogether, but 
not huge.  The reduce would effectively take the place of atomic append.

> Unless I didn't follow your logic correctly on HADOOP-939, it seems like this 
> optimization is orthogonal.

My argument is that locating reduces near maps will substantially help in both 
this case (no reduce needed) and in  HADOOP-939 (no sort needed), and that it 
will also help applications which only lean lightly on sort and reduce.  That 
optimization gets the vast majority of the theoretical speedup with no public 
API changes, no special control flow, etc. and works for more applications.  So 
I'd give that higher priority.

> Map directly to HDFS or reduce()
> --------------------------------
>
>                 Key: HADOOP-946
>                 URL: https://issues.apache.org/jira/browse/HADOOP-946
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> For situations where you know that the output of the Map phase is already 
> aggregated (e.g. the input is the output of another Map-reduce job and map() 
> preserves the aggregation), then there should be a way to tell the framework 
> that this is the case so that it can pipe the map() output directly to the 
> reduce() function, or HDFS in the case of IdentityReducer.  This will 
> probably require forcing the number of map tasks to equal the number of 
> reduce tasks.  This will save the disk I/O required to generate intermediate 
> files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-946) Map directly to HDFS or reduce()

Reply via email to