Hi,

We’re trying to distribute batch input data to (N) HDFS files partitioning by 
hash using DataSet API. What I’m doing is like:

env.createInput(…)
      .partitionByHash(0)
      .setParallelism(N)
      .output(…)

This works well for small number of files. But when we need to distribute to 
large number of files (say 100K), the parallelism becomes too large and we 
could not afford that many TMs.

In spark we can write something like ‘rdd.partitionBy(N)’ and control the 
parallelism separately (using dynamic allocation). Is there anything similar in 
Flink or other way we can achieve similar result? Thank you!

Qi

Reply via email to