bkietz opened a new pull request #9621:
URL: https://github.com/apache/arrow/pull/9621
In order to keep this patch simpler, the execution framework for scalar
aggregate kernels is reused for grouped aggregations. This is not intended as a
permanent arrangement.
A `compute::Function` is added which implements grouped aggregation.
`GroupByOptions::aggregates` is a vector specifying which
aggregations will be performed: each element is a
GroupByOptions::Aggregate` containing the name of an aggregate
function and a pointer to a `FunctionOptions`. The first arguments to
`group_by` are interpreted as the corresponding aggregands and the remainder
will be used as grouping keys. The output will be an array with the same
number of fields where each slot contains the aggregation result and keys
for a group:
```c++
GroupByOptions options{
{"sum", nullptr}, // first argument will be summed
{"min_max",
&min_max_options}, // second argument's extrema will be found
};
std::shared_ptr<arrow::Array> needs_sum = ...;
std::shared_ptr<arrow::Array> needs_min_max = ...;
std::shared_ptr<arrow::Array> key_0 = ...;
std::shared_ptr<arrow::Array> key_1 = ...;
ARROW_ASSIGN_OR_RAISE(arrow::Datum out,
arrow::compute::CallFunction("group_by",
{
needs_sum,
needs_min_max,
key_0,
key_1,
},
&options));
// Unpack struct array result (a four-field array)
auto out_array = out.array_as<StructArray>();
std::shared_ptr<arrow::Array> sums = out_array->field(0);
std::shared_ptr<arrow::Array> mins_and_maxes = out_array->field(1);
std::shared_ptr<arrow::Array> group_key_0 = out_array->field(2);
std::shared_ptr<arrow::Array> group_key_1 = out_array->field(3);
```
TODO:
- [ ] Only sum, count, and min_max aggregators are implemented
- [ ] Add an aggregator which returns a list of row indices of members for
use in partitioned dataset writing
- [ ] Reorganization
- [ ] Comments
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]