[ 
https://issues.apache.org/jira/browse/HADOOP-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675453#action_12675453
 ] 

Milind Bhandarkar commented on HADOOP-5299:
-------------------------------------------

@Arun

>   1.  Typical mid-to-large applications have reducers which process multiple 
> gigabytes of data (e.g. our friend, the sort<nodes>, has each reducer 
> generating 5GB) which means we'll need atleast tens of intermediate merged 
> files per reducer (assuming a heap-size of 512M).

Would it be possible to keep a single file open for writing, and only flushing 
the buffer on spill ? Thus, total number of input files is still one per 
reducer ?
 
>   2. io.sort.factor defaults to 100, which means we might open upto a 100 
> files for the final merge which then feeds the 'reduce' call. Given that we 
> might have every reducer in the cluster open a 100 files at time; for a 
> mid-size cluster of 500 nodes with 4 reduce slots that would be 2000 * 100 ...

If the trick above works, then this could be only one file open, with multiple 
preads, no ?


> Reducer inputs should be spilled to HDFS rather than local disk.
> ----------------------------------------------------------------
>
>                 Key: HADOOP-5299
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5299
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.19.0
>         Environment: All
>            Reporter: Milind Bhandarkar
>
> Currently, both map outputs and reduce inputs are stored on local disks of 
> tasktrackers. (Un) Availability of local disk space for intermediate data is 
> seen as a major factor in job failures. 
> The suggested solution is to store these intermediate data on HDFS (maybe 
> with replication factor of 1). However, the main blocker issue with that 
> solution is that lots of temporary names (proportional to total number of 
> maps), can overwhelm the namenode, especially since the map outputs are 
> typically small (most produce one block output).
> Also, as we see in many applications, the map outputs can be estimated more 
> accurately, and thus users can plan accordingly, based on available local 
> disk space.
> However, the reduce input sizes can vary a lot, especially for skewed data 
> (or because of bad partitioning.)
> So, I suggest that it makes more sense to keep map outputs on local disks, 
> but the reduce inputs (when spilled from reducer memory) should go to HDFS.
> Adding a configuration variable to indicate the filesystem to be used for 
> reduce-side spills would let us experiment and compare the efficiency of this 
> new scheme.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to