[ 
https://issues.apache.org/jira/browse/HADOOP-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468375
 ] 

Doug Cutting commented on HADOOP-946:
-------------------------------------

Note that there is a standard workaround for this: have your map function write 
output to DFS as a side-effect.  One could write a MapRunnable that automates 
this, using the job's specified OutputFormat to create RecordWriters, etc.

> This will probably require forcing the number of map tasks to equal the 
> number of reduce tasks.

Actually, the ability to keep a large number of maps (one per block) and a much 
smaller number of outputs would be the primary reason I can see for adding this 
feature to the framework, as opposed to using something like the workaround 
mentioned above.  If we place reduces to nodes or racks where their input 
dominates (as discussed in HADOOP-939 comments), then this could be implemented 
by simply specifying a partition method that returns the hash of the map node 
name.

> Map directly to HDFS or reduce()
> --------------------------------
>
>                 Key: HADOOP-946
>                 URL: https://issues.apache.org/jira/browse/HADOOP-946
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Doug Judd
>
> For situations where you know that the output of the Map phase is already 
> aggregated (e.g. the input is the output of another Map-reduce job and map() 
> preserves the aggregation), then there should be a way to tell the framework 
> that this is the case so that it can pipe the map() output directly to the 
> reduce() function, or HDFS in the case of IdentityReducer.  This will 
> probably require forcing the number of map tasks to equal the number of 
> reduce tasks.  This will save the disk I/O required to generate intermediate 
> files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to