Hi, We’re trying to distribute batch input data to (N) HDFS files partitioning by hash using DataSet API. What I’m doing is like:
env.createInput(…) .partitionByHash(0) .setParallelism(N) .output(…) This works well for small number of files. But when we need to distribute to large number of files (say 100K), the parallelism becomes too large and we could not afford that many TMs. In spark we can write something like ‘rdd.partitionBy(N)’ and control the parallelism separately (using dynamic allocation). Is there anything similar in Flink or other way we can achieve similar result? Thank you! Qi