Re: Group rows in a stream of record batches by group id?

2023-06-14 Thread Haocheng Liu
e, after the table has > been created in its entirety, that could lead to a large memory footprint > in that use case. > > Thanks! > > > -----Original Message----- > From: Weston Pace > Sent: Tuesday, June 13, 2023 2:11 PM > To: dev@arrow.apache.org > Subject: Re

RE: Group rows in a stream of record batches by group id?

2023-06-14 Thread Jerry Adair
nce, after the table has been created in its entirety, that could lead to a large memory footprint in that use case. Thanks! -Original Message- From: Weston Pace Sent: Tuesday, June 13, 2023 2:11 PM To: dev@arrow.apache.org Subject: Re: Group rows in a stream of record batches by grou

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
(Admittedly, PR title of [1] doesn't reflect that only the scalar aggregate UDF is implemented and not the hash one - that is an oversight on my part - sorry) On Tue, Jun 13, 2023 at 3:51 PM Li Jin wrote: > Thanks Weston. > > I think I found what you pointed out to me before which is this bit of

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
Thanks Weston. I think I found what you pointed out to me before which is this bit of code: https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/partition.cc#L118 I will try if I can adapt this to be used in streaming situation. > I know you recently added [1] and I'm maybe a little un

Re: Group rows in a stream of record batches by group id?

2023-06-13 Thread Weston Pace
Are you looking for something in C++ or python? We have a thing called the "grouper" (arrow::compute::Grouper in arrow/compute/row/grouper.h) which (if memory serves) is the heart of the functionality in C++. It would be nice to add some python bindings for this functionality as this ask comes up

Group rows in a stream of record batches by group id?

2023-06-13 Thread Li Jin
Hi, I am trying to write a function that takes a stream of record batches (where the last column is group id), and produces k record batches, where record batches k_i contain all the rows with group id == i. Pseudocode is sth like: def group_rows(batches, k) -> array[RecordBatch] { builder