[
https://issues.apache.org/jira/browse/HADOOP-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675422#action_12675422
]
Milind Bhandarkar commented on HADOOP-5299:
-------------------------------------------
The reasoning behind this suggested change is as follows:
The reasoning are as follows:
1. The number of reducers is typically an order of magnitude smaller than
number of mappers, thus creating these merged inputs on HDFS, they will not
overwhelm the namenode.
2. Reduce input will be spilled to HDFS only in case where it does not fit
in reducer memory.
3. While reducer is fetching map outputs, no user-code for reduce is
executing, thus one can use large buffer for reduce inputs, and so spilling
on HDFS will create large files (say 4-5 blocks each.) Moreover, it's generally
equal to the number of output files that will
be written, so proportional to what must already be available on the
namenode.
4. While reducers are fetching data from mappers, they are network bound,
and waiting for maps to complete anyway. So, the overhead of writing to HDFS
is negligible (overlap with wait time for map outputs).
5. The terasort benchmark does not spill reduce input to disk, so will have
no adverse effect on that benchmark with this scheme.
6. Since reducer inputs are available on HDFS, only those maps whose outputs
have not reached all the reducers need to be re-executed (since the reducer
restarted on a different node will have access to the already spilled output
on HDFS.)
7. A speculative attempt for a reducer will not need to refetch all the map
outputs again.
and
8. We can more easily pause a long-running low-priority job when a job
with higher priority is submitted, without losing much work and without
consuming RAM or local temporary disk space. (Thanks to Doug)
> Reducer inputs should be spilled to HDFS rather than local disk.
> ----------------------------------------------------------------
>
> Key: HADOOP-5299
> URL: https://issues.apache.org/jira/browse/HADOOP-5299
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.19.0
> Environment: All
> Reporter: Milind Bhandarkar
>
> Currently, both map outputs and reduce inputs are stored on local disks of
> tasktrackers. (Un) Availability of local disk space for intermediate data is
> seen as a major factor in job failures.
> The suggested solution is to store these intermediate data on HDFS (maybe
> with replication factor of 1). However, the main blocker issue with that
> solution is that lots of temporary names (proportional to total number of
> maps), can overwhelm the namenode, especially since the map outputs are
> typically small (most produce one block output).
> Also, as we see in many applications, the map outputs can be estimated more
> accurately, and thus users can plan accordingly, based on available local
> disk space.
> However, the reduce input sizes can vary a lot, especially for skewed data
> (or because of bad partitioning.)
> So, I suggest that it makes more sense to keep map outputs on local disks,
> but the reduce inputs (when spilled from reducer memory) should go to HDFS.
> Adding a configuration variable to indicate the filesystem to be used for
> reduce-side spills would let us experiment and compare the efficiency of this
> new scheme.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.