Hi All,

I'm working on a design for ordered aggregations in Arrow C++ and would like to 
get some opinions about it. Ordered aggregation is similar to grouped 
aggregation except that one column in the grouping key is (known to be) 
ordered. The result of both types of aggregations is the same but the existence 
of an ordered column enables optimizing. In particular, the grouping keys and 
their aggregation output can be flushed whenever all rows having the same 
ordered value have been processed.

Main points of the design:

  1.  Create a class OrderedGroupAggregator derived from GroupedAggregator. 
This class will wrap a factory for an inner OrderedAggregator, passed in via 
options. The factory allows the inner OrderedAggregator to be recreated for the 
next ordered value after it is finalized for the current ordered value.
  2.  Add a callback API that accepts finalized aggregation output. This 
callback will be implemented by the aggregation node and will stream-forward 
finalized aggregation output.
  3.  Add an interface for selecting the next range of rows from the batch to 
be processed. The range may be open-ended, meaning that more rows in the next 
batch may be added to the range. In the grouped aggregation case this range 
would be the entire batch and open-ended, whereas in the ordered aggregation 
case this range would end at a row where the ordered value changes. After a 
range has been processed, finalized aggregation output will be flushed.

Cheers,
Yaron.

Reply via email to