e, after the table has
> been created in its entirety, that could lead to a large memory footprint
> in that use case.
>
> Thanks!
>
>
> -----Original Message-----
> From: Weston Pace
> Sent: Tuesday, June 13, 2023 2:11 PM
> To: dev@arrow.apache.org
> Subject: Re
nce, after the table has been created in its entirety, that could
lead to a large memory footprint in that use case.
Thanks!
-Original Message-
From: Weston Pace
Sent: Tuesday, June 13, 2023 2:11 PM
To: dev@arrow.apache.org
Subject: Re: Group rows in a stream of record batches by grou
(Admittedly, PR title of [1] doesn't reflect that only the scalar aggregate
UDF is implemented and not the hash one - that is an oversight on my part -
sorry)
On Tue, Jun 13, 2023 at 3:51 PM Li Jin wrote:
> Thanks Weston.
>
> I think I found what you pointed out to me before which is this bit of
Thanks Weston.
I think I found what you pointed out to me before which is this bit of code:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/partition.cc#L118
I will try if I can adapt this to be used in streaming situation.
> I know you recently added [1] and I'm maybe a little un
Are you looking for something in C++ or python? We have a thing called the
"grouper" (arrow::compute::Grouper in arrow/compute/row/grouper.h) which
(if memory serves) is the heart of the functionality in C++. It would be
nice to add some python bindings for this functionality as this ask comes
up
Hi,
I am trying to write a function that takes a stream of record batches
(where the last column is group id), and produces k record batches, where
record batches k_i contain all the rows with group id == i.
Pseudocode is sth like:
def group_rows(batches, k) -> array[RecordBatch] {
builder