[jira] Commented: (HADOOP-946) Map directly to HDFS or reduce()

Doug Judd (JIRA) Mon, 29 Jan 2007 12:00:12 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468389
 ]


Doug Judd commented on HADOOP-946:
----------------------------------

> Actually, the ability to keep a large number of maps (one per block) and a 
> much smaller number of outputs would be the primary reason I can see for 
> adding this feature [...]

Seems like you'd need an atomic record append API to handle this.

> If we place reduces to nodes or racks where their input dominates (as 
> discussed in HADOOP-939 comments), then this could be implemented by simply 
> specifying a partition method that returns the hash of the map node name.

I guess the primary reason for this enhancement would be to avoid writing and 
subsequently reading intermediate files thereby reducing disk load on the 
system as a whole.  The map-to-hdfs workaround sounds reasonable.  If you want 
to just run a reduce without generating intermediate files, then the reduce 
task needs to be able to pull from HDFS.  Unless I didn't follow your logic 
correctly on HADOOP-939, it seems like this optimization is orthogonal.


> Map directly to HDFS or reduce()
> --------------------------------
>
>                 Key: HADOOP-946
>                 URL: https://issues.apache.org/jira/browse/HADOOP-946
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> For situations where you know that the output of the Map phase is already 
> aggregated (e.g. the input is the output of another Map-reduce job and map() 
> preserves the aggregation), then there should be a way to tell the framework 
> that this is the case so that it can pipe the map() output directly to the 
> reduce() function, or HDFS in the case of IdentityReducer.  This will 
> probably require forcing the number of map tasks to equal the number of 
> reduce tasks.  This will save the disk I/O required to generate intermediate 
> files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-946) Map directly to HDFS or reduce()

Reply via email to