[
https://issues.apache.org/jira/browse/HADOOP-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675952#action_12675952
]
Owen O'Malley commented on HADOOP-5299:
---------------------------------------
I really don't think map/reduce should use an internal evolving api of hdfs.
That would completely break the possibility of upgrading hdfs and map/reduce
independently.
Writing intermediates to hdfs, as you know, is not and can't be a performance
gain. Clearly the goal is to remove the task tracker's attempts at managing
local store. So, I agree with Eric that this jira isn't very motivated without
the map outputs too. Unlike Eric, I don't think *either* is a good idea.
> Reducer inputs should be spilled to HDFS rather than local disk.
> ----------------------------------------------------------------
>
> Key: HADOOP-5299
> URL: https://issues.apache.org/jira/browse/HADOOP-5299
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.19.0
> Environment: All
> Reporter: Milind Bhandarkar
>
> Currently, both map outputs and reduce inputs are stored on local disks of
> tasktrackers. (Un) Availability of local disk space for intermediate data is
> seen as a major factor in job failures.
> The suggested solution is to store these intermediate data on HDFS (maybe
> with replication factor of 1). However, the main blocker issue with that
> solution is that lots of temporary names (proportional to total number of
> maps), can overwhelm the namenode, especially since the map outputs are
> typically small (most produce one block output).
> Also, as we see in many applications, the map outputs can be estimated more
> accurately, and thus users can plan accordingly, based on available local
> disk space.
> However, the reduce input sizes can vary a lot, especially for skewed data
> (or because of bad partitioning.)
> So, I suggest that it makes more sense to keep map outputs on local disks,
> but the reduce inputs (when spilled from reducer memory) should go to HDFS.
> Adding a configuration variable to indicate the filesystem to be used for
> reduce-side spills would let us experiment and compare the efficiency of this
> new scheme.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.