[ https://issues.apache.org/jira/browse/BEAM-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dawid Wysakowicz reassigned BEAM-2873: -------------------------------------- Assignee: Dawid Wysakowicz > Detect number of shards for file sink in Flink Streaming Runner > --------------------------------------------------------------- > > Key: BEAM-2873 > URL: https://issues.apache.org/jira/browse/BEAM-2873 > Project: Beam > Issue Type: Improvement > Components: runner-flink > Reporter: Aljoscha Krettek > Assignee: Dawid Wysakowicz > Priority: Major > > [~reuvenlax] mentioned that this is done for the Dataflow Runner and the > default behaviour on Flink can be somewhat surprising for users. > ML entry: https://www.mail-archive.com/dev@beam.apache.org/msg02665.html: > This is how the file sink has always worked in Beam. If no sharding is > specified, then this means runner-determined sharding, and by default that is > one file per bundle. If Flink has small bundles, then I suggest using the > withNumShards method to explicitly pick the number of output shards. > The Flink runner can detect that runner-determined sharding has been chosen, > and override it with a specific number of shards. For example, the Dataflow > streaming runner (which as you mentioned also has small bundles) detects this > case and sets the number of out files shards based on the number of workers > in the worker pool > [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354] > is the code that does this; it should be quite simple to do something > similar for Flink, and then there will be no need for users to explicitly > call withNumShards themselves. -- This message was sent by Atlassian JIRA (v7.6.3#76005)