[
https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557477#action_12557477
]
eric baldeschwieler commented on HADOOP-2560:
---------------------------------------------
Nodes operate at different rates. Failures happen. In the face of several
jobs running, some nodes may not even become available in a timely manner. I
think a static approach will not allow both the performance gains desired and
preservation of reasonable throughput.
The current system takes full advantage of mapping jobs to nodes dynamically.
A static combination of splits will break all of this. One could perhaps do
something like what you suggest dynamically in the JT when a TT requests a new
job. This might be a good compromise implementation. This would also let you
observe some global statistics on speed of maps & size of outputs which would
let you optimize cluster sizes. Of course doing this all dynamically on the
TTs might use fewer JT resources.
> Combining multiple input blocks into one mapper
> -----------------------------------------------
>
> Key: HADOOP-2560
> URL: https://issues.apache.org/jira/browse/HADOOP-2560
> Project: Hadoop
> Issue Type: Bug
> Reporter: Runping Qi
>
> Currently, an input split contains a consecutive chunk of input file, which
> by default, corresponding to a DFS block.
> This may lead to a large number of mapper tasks if the input data is large.
> This leads to the following problems:
> 1. Shuffling cost: since the framework has to move M * R map output segments
> to the nodes running reducers,
> larger M means larger shuffling cost.
> 2. High JVM initialization overhead
> 3. Disk fragmentation: larger number of map output files means lower read
> throughput for accessing them.
> Ideally, you want to keep the number of mappers to no more than 16 times the
> number of nodes in the cluster.
> To achive that, we can increase the input split size. However, if a split
> span over more than one dfs block,
> you lose the data locality scheduling benefits.
> One way to address this problem is to combine multiple input blocks with the
> same rack into one split.
> If in average we combine B blocks into one split, then we will reduce the
> number of mappers by a factor of B.
> Since all the blocks for one mapper share a rack, thus we can benefit from
> rack-aware scheduling.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.