Are you looking for something in C++ or python?  We have a thing called the
"grouper" (arrow::compute::Grouper in arrow/compute/row/grouper.h) which
(if memory serves) is the heart of the functionality in C++.  It would be
nice to add some python bindings for this functionality as this ask comes
up from pyarrow users pretty regularly.

The grouper is used in src/arrow/dataset/partition.h to partition a record
batch into groups of batches.  This is how the dataset writer writes a
partitioned dataset.  It's a good example of how you would use the grouper
for a "one batch in, one batch per group out" use case.

The grouper can also be used in a streaming situation (many batches in, one
batch per group out).  In fact, the grouper is what is used by the group by
node.  I know you recently added [1] and I'm maybe a little uncertain what
the difference is between this ask and the capabilities added in [1].

[1] https://github.com/apache/arrow/pull/35514

On Tue, Jun 13, 2023 at 8:23 AM Li Jin <ice.xell...@gmail.com> wrote:

> Hi,
>
> I am trying to write a function that takes a stream of record batches
> (where the last column is group id), and produces k record batches, where
> record batches k_i contain all the rows with group id == i.
>
> Pseudocode is sth like:
>
> def group_rows(batches, k) -> array[RecordBatch] {
>       builders = array[RecordBatchBuilder](k)
>       for batch in batches:
>            # Assuming last column is the group id
>            group_ids = batch.column(-1)
>            for i in batch.num_rows():
>                 k_i = group_ids[i]
>                 builders[k_i].append(batch[i])
>
>        batches = array[RecordBatch](k)
>        for i in range(k):
>            batches[i] = builders[i].build()
>        return batches
> }
>
> I wonder if there is some existing code that does something like this?
> (Specially I didn't find code that can append row/rows to a
> RecordBatchBuilder (either one row given an row index, or multiple rows
> given a list of row indices)
>

Reply via email to