pabloem commented on PR #25173: URL: https://github.com/apache/beam/pull/25173#issuecomment-1404398659
Autosharding is a particular capability of the Dataflow runner. See: https://cloud.google.com/blog/products/data-analytics/3x-dataflow-throughput-auto-sharding-bigquery It alows us to dynamically grow the number of individual shards by the number of streaming workers, instead of having a fixed number of shards, which is the case for other runners. The outcome is that - Without autosharding: for a given key X, all of the elements for that key will end up processed serially (usually by the same worker) - With autosharding, GroupIntoBatches uses a ShardedKey where, for a key X, elements for that key will be batched into parallel shards that grow proportionally to the number of workers -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
