Re: design for ordered aggregation

Yaron Gvili Wed, 07 Sep 2022 03:16:42 -0700

Thanks, Weston.

Note that a change in the ordered value is only an opportunity to flush, which 
may or may not be skipped. It is an optimization choice which opportunities to 
take, and the full aggregation output will be equivalent (up to batching) for 
any such choice. The design makes this choice configurable via the range 
selection interface - the range can cover one or more ordered values, but is 
not allowed to partially cover any ordered value.



Yaron.
________________________________
From: Weston Pace <weston.p...@gmail.com>
Sent: Tuesday, September 6, 2022 6:02 PM
To: dev@arrow.apache.org <dev@arrow.apache.org>
Subject: Re: design for ordered aggregation

It seems like a reasonable approach.  I think my initial gut feeling
would be that initializing and finalizing state for each change of key
might be a bit heavyweight in cases where there are only a few values
per key.  I think these cases are fairly common as a data
simplification / cleaning pass.  Probably only benchmarks will really
say.

On Tue, Sep 6, 2022 at 10:26 AM Yaron Gvili <rt...@hotmail.com> wrote:
>
> Hi All,
>
> I'm working on a design for ordered aggregations in Arrow C++ and would like 
> to get some opinions about it. Ordered aggregation is similar to grouped 
> aggregation except that one column in the grouping key is (known to be) 
> ordered. The result of both types of aggregations is the same but the 
> existence of an ordered column enables optimizing. In particular, the 
> grouping keys and their aggregation output can be flushed whenever all rows 
> having the same ordered value have been processed.
>
> Main points of the design:
>
>   1.  Create a class OrderedGroupAggregator derived from GroupedAggregator. 
> This class will wrap a factory for an inner OrderedAggregator, passed in via 
> options. The factory allows the inner OrderedAggregator to be recreated for 
> the next ordered value after it is finalized for the current ordered value.
>   2.  Add a callback API that accepts finalized aggregation output. This 
> callback will be implemented by the aggregation node and will stream-forward 
> finalized aggregation output.
>   3.  Add an interface for selecting the next range of rows from the batch to 
> be processed. The range may be open-ended, meaning that more rows in the next 
> batch may be added to the range. In the grouped aggregation case this range 
> would be the entire batch and open-ended, whereas in the ordered aggregation 
> case this range would end at a row where the ordered value changes. After a 
> range has been processed, finalized aggregation output will be flushed.
>
> Cheers,
> Yaron.

Re: design for ordered aggregation

Reply via email to