Robert Metzger created FLINK-1287: ------------------------------------- Summary: Improve File Input Split assignment Key: FLINK-1287 URL: https://issues.apache.org/jira/browse/FLINK-1287 Project: Flink Issue Type: Improvement Components: Local Runtime Reporter: Robert Metzger
While running some DFS read-intensive benchmarks, I found that the assignment of input splits is not optimal. In particular in cases where the numWorker != numDataNodes and when the replication factor is low (in my case it was 1). In the particular example, the input had 40960 splits, of which 4694 were read remotely. Spark did only 2056 remote reads for the same dataset. With the replication factor increased to 2, Flink did only 290 remote reads. So usually, users shouldn't be affected by this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)