Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-14 Thread Wenbing Bai
Thank you for the clarification! Is there a way to control the number of shards, i.e. the bundle? I know in pure Beam IO connectors, we have num_shards supported, for example, WriteToParquet

Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-13 Thread Robert Bradshaw
Sharding is determined by the distribution of work. Each worker writes to its own shard, and in the case of dynamic partitioning, etc. workers may end up processing more than one "bundle" of items and hence produce more than one shard. See also https://beam.apache.org/documentation/runtime/model/

Re: [EXT] Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-13 Thread Wenbing Bai
Hi team, I have another question when using Beam Dataframe IO connector. I tried to_parquet, and my data are written to several different files. I am wondering how I can control the number of files (shards) or how the sharding is done for to_parquet and other Beam Dataframe IO APIs? Thank you! We

Re: [EXT] Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-05-10 Thread Wenbing Bai
Hi Robert and Brian, I don't know why I didn't catch your replies. But thank you so much for looking at this. My parquet files will be consumed by downstreaming processes which require data points with the same "key1" that are sorted by "key2". The downstreaming process, for example, will make a

Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-20 Thread Robert Bradshaw
It would also be helpful to understand what your overall objective is with this output. Is there a reason you need it sorted/partitioned in a certain way? On Tue, Apr 20, 2021 at 4:51 PM Brian Hulette wrote: > Hi Wenbing, > Sorry for taking so long to get back to you on this. > I discussed this

Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-20 Thread Brian Hulette
Hi Wenbing, Sorry for taking so long to get back to you on this. I discussed this with Robert offline and we came up with a potential workaround - you could try writing out the Parquet file from within the groupby.apply method. You can use beam's FileSystems abstraction to open a Python file object

Re: [EXT] Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-07 Thread Wenbing Bai
Thank you, Brian. I tried `partition_cols`, but it is not working. I tried pure pandas, it does work, so I am not sure if anything wrong with Beam. Wenbing On Wed, Apr 7, 2021 at 2:56 PM Brian Hulette wrote: > Hm, to_parquet does have a `partition_cols` argument [1] which we pass > through [2].

Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-07 Thread Brian Hulette
Hm, to_parquet does have a `partition_cols` argument [1] which we pass through [2]. It would be interesting to see what `partition_cols='key1'` does - I suspect it won't work perfectly though. Do you have any thoughts here Robert? [1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pa

Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-07 Thread Wenbing Bai
Hi Robert and Brian, I tried groupby in my case. Here is my pipeline code. I do see all the data in the final parquet file are sorted in each group. However, I'd like to write each partition (group) to an individual file, how can I achieve it? In addition, I am using the master of Apache Beam SDK,

Re: [EXT] Re: Beam Dataframe - sort and grouping

2021-04-02 Thread Wenbing Bai
Thank you, Robert and Brian. I'd like to try this out. I am trying to distribute my dataset to nodes, sort each partition by some key and then store each partition to its own file. Wenbing On Fri, Apr 2, 2021 at 9:23 AM Brian Hulette wrote: > Note groupby.apply [1] in particular should be able

Re: Beam Dataframe - sort and grouping

2021-04-02 Thread Brian Hulette
Note groupby.apply [1] in particular should be able to do what you want, something like: df.groupby('key1').apply(lambda df: df.sort_values('key2')) But as Robert noted we don't make any guarantees about preserving this ordering later in the pipeline. For this reason I actually just sent a PR t

Re: Beam Dataframe - sort and grouping

2021-04-02 Thread Robert Bradshaw
Thanks for trying this out. Better support for groupby (e.g. https://github.com/apache/beam/pull/13843 , https://github.com/apache/beam/pull/13637) will be available in the next Beam release (2.29, in progress, but you could try out head if you want). Note, however, that Beam PCollections are by d

Beam Dataframe - sort and grouping

2021-04-01 Thread Wenbing Bai
Hi Beam users, I have a user case to partition my PCollection by some key, and then sort my rows within the same partition by some other key. I feel Beam Dataframe could be a candidate solution, but I cannot figure out how to make it work. Specifically, I tried df.groupby where I expect my data w