Sharding is determined by the distribution of work. Each worker writes to
its own shard, and in the case of dynamic partitioning, etc. workers may
end up processing more than one "bundle" of items and hence produce more
than one shard. See also
https://beam.apache.org/documentation/runtime/model/
Hi team,
I have another question when using Beam Dataframe IO connector. I tried
to_parquet, and my data are written to several different files. I am
wondering how I can control the number of files (shards) or how the
sharding is done for to_parquet and other Beam Dataframe IO APIs?
Thank you!
We