Thank you for the clarification! Is there a way to control the number of
shards, i.e. the bundle? I know in pure Beam IO connectors, we have
num_shards supported, for example, WriteToParquet
Sharding is determined by the distribution of work. Each worker writes to
its own shard, and in the case of dynamic partitioning, etc. workers may
end up processing more than one "bundle" of items and hence produce more
than one shard. See also
https://beam.apache.org/documentation/runtime/model/
Hi team,
I have another question when using Beam Dataframe IO connector. I tried
to_parquet, and my data are written to several different files. I am
wondering how I can control the number of files (shards) or how the
sharding is done for to_parquet and other Beam Dataframe IO APIs?
Thank you!