Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-14 Thread Wenbing Bai
Thank you for the clarification! Is there a way to control the number of shards, i.e. the bundle? I know in pure Beam IO connectors, we have num_shards supported, for example, WriteToParquet

Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-13 Thread Robert Bradshaw
Sharding is determined by the distribution of work. Each worker writes to its own shard, and in the case of dynamic partitioning, etc. workers may end up processing more than one "bundle" of items and hence produce more than one shard. See also https://beam.apache.org/documentation/runtime/model/

Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-13 Thread Wenbing Bai
Hi team, I have another question when using Beam Dataframe IO connector. I tried to_parquet, and my data are written to several different files. I am wondering how I can control the number of files (shards) or how the sharding is done for to_parquet and other Beam Dataframe IO APIs? Thank you! We

Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-10 Thread Wenbing Bai
Hi Robert and Brian, I don't know why I didn't catch your replies. But thank you so much for looking at this. My parquet files will be consumed by downstreaming processes which require data points with the same "key1" that are sorted by "key2". The downstreaming process, for example, will make a

Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-20 Thread Robert Bradshaw
It would also be helpful to understand what your overall objective is with this output. Is there a reason you need it sorted/partitioned in a certain way? On Tue, Apr 20, 2021 at 4:51 PM Brian Hulette wrote: > Hi Wenbing, > Sorry for taking so long to get back to you on this. > I discussed this

Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-20 Thread Brian Hulette
Hi Wenbing, Sorry for taking so long to get back to you on this. I discussed this with Robert offline and we came up with a potential workaround - you could try writing out the Parquet file from within the groupby.apply method. You can use beam's FileSystems abstraction to open a Python file object

Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-07 Thread Wenbing Bai
Thank you, Brian. I tried `partition_cols`, but it is not working. I tried pure pandas, it does work, so I am not sure if anything wrong with Beam. Wenbing On Wed, Apr 7, 2021 at 2:56 PM Brian Hulette wrote: > Hm, to_parquet does have a `partition_cols` argument [1] which we pass > through [2].