Re: Why is GroupBy involved in the file save operation?

Reuven Lax Fri, 21 May 2021 16:27:50 -0700

What you describe is what happens (at least in the Dataflow runner) if auto
sharding is specified in batch. This mechanism tries to split the
PColllection to fully utilize every worker, so is not appropriate when a
fixed number of shards is desired. A GroupByKey is also necessary in
streaming in order to split an unbounded PColllection using
windows/triggers, as windows and triggers are applied during GroupByKey.


On Fri, May 21, 2021 at 4:16 PM Tao Li <t...@zillow.com> wrote:

> Hi Beam community,
>
>
>
> I wonder why a GroupBy operation is involved in WriteFiles:
> https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html
>
>
>
> This doc mentioned “ The exact parallelism of the write stage can be
> controlled using withNumShards(int)
> <https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/WriteFiles.html#withNumShards-int->,
> typically used to control how many files are produced or to globally limit
> the number of workers connecting to an external service. However, this
> option can often hurt performance: it adds an additional GroupByKey
> <https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/transforms/GroupByKey.html>
>  to
> the pipeline.”
>
>
>
> When we are saving the PCollection into multiple files, why can’t we
> simply split the PCollection without a key and save each split as a file?
>
>
>
> Thanks!
>

Re: Why is GroupBy involved in the file save operation?

Reply via email to