What you describe is what happens (at least in the Dataflow runner) if auto sharding is specified in batch. This mechanism tries to split the PColllection to fully utilize every worker, so is not appropriate when a fixed number of shards is desired. A GroupByKey is also necessary in streaming in order to split an unbounded PColllection using windows/triggers, as windows and triggers are applied during GroupByKey.
On Fri, May 21, 2021 at 4:16 PM Tao Li <t...@zillow.com> wrote: > Hi Beam community, > > > > I wonder why a GroupBy operation is involved in WriteFiles: > https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html > > > > This doc mentioned “ The exact parallelism of the write stage can be > controlled using withNumShards(int) > <https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html#withNumShards-int->, > typically used to control how many files are produced or to globally limit > the number of workers connecting to an external service. However, this > option can often hurt performance: it adds an additional GroupByKey > <https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/transforms/GroupByKey.html> > to > the pipeline.” > > > > When we are saving the PCollection into multiple files, why can’t we > simply split the PCollection without a key and save each split as a file? > > > > Thanks! >