[
https://issues.apache.org/jira/browse/FLINK-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14241300#comment-14241300
]
ASF GitHub Bot commented on FLINK-1287:
---------------------------------------
GitHub user fhueske opened a pull request:
https://github.com/apache/incubator-flink/pull/258
[FLINK-1287] LocalizableSplitAssigner prefers splits with less degrees of
freedom
The current LocalizableSplitAssigner assigns remote and local splits
without priorities, i.e., each remote (local) split has the same probability of
being assigned.
With this change, splits are prioritised by the number of hosts they can be
locally read from. Splits with fewer options to be locally read are preferred
over others that can be locally accessed from many hosts.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/fhueske/incubator-flink splitAssignment
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-flink/pull/258.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #258
----
commit 8900b91dc50a55d1149482c3f9c10dc3eaa5048c
Author: Fabian Hueske <[email protected]>
Date: 2014-12-07T21:28:22Z
[FLINK-1287] LocalizableSplitAssigner prefers splits with less degrees of
freedom
----
> Improve File Input Split assignment
> -----------------------------------
>
> Key: FLINK-1287
> URL: https://issues.apache.org/jira/browse/FLINK-1287
> Project: Flink
> Issue Type: Improvement
> Components: Local Runtime
> Reporter: Robert Metzger
> Assignee: Fabian Hueske
>
> While running some DFS read-intensive benchmarks, I found that the assignment
> of input splits is not optimal. In particular in cases where the numWorker !=
> numDataNodes and when the replication factor is low (in my case it was 1).
> In the particular example, the input had 40960 splits, of which 4694 were
> read remotely. Spark did only 2056 remote reads for the same dataset.
> With the replication factor increased to 2, Flink did only 290 remote reads.
> So usually, users shouldn't be affected by this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)