Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-14 Thread Wenbing Bai
Thank you for the clarification! Is there a way to control the number of shards, i.e. the bundle? I know in pure Beam IO connectors, we have num_shards supported, for example, WriteToParquet

Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-13 Thread Robert Bradshaw
Sharding is determined by the distribution of work. Each worker writes to its own shard, and in the case of dynamic partitioning, etc. workers may end up processing more than one "bundle" of items and hence produce more than one shard. See also https://beam.apache.org/documentation/runtime/model/

Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-13 Thread Wenbing Bai
Hi team, I have another question when using Beam Dataframe IO connector. I tried to_parquet, and my data are written to several different files. I am wondering how I can control the number of files (shards) or how the sharding is done for to_parquet and other Beam Dataframe IO APIs? Thank you! We

Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-11 Thread Kenneth Knowles
+dev In the Beam Java ecosystem, this functionality is provided by the Sorter library (https://beam.apache.org/documentation/sdks/java-extensions/#sorter). I'm curious what people think about various options: - Python version of the transform(s) - Expose sorter as xlang transform(s) - Conveni