[GitHub] [arrow] rtpsw commented on a diff in pull request #14352: ARROW-17642: [C++] Add ordered aggregation

GitBox Thu, 10 Nov 2022 13:51:17 -0800


rtpsw commented on code in PR #14352:
URL: https://github.com/apache/arrow/pull/14352#discussion_r1019644565



##########
cpp/src/arrow/compute/exec/options.h:
##########
@@ -106,21 +106,32 @@ class ARROW_EXPORT ProjectNodeOptions : public 
ExecNodeOptions {
   std::vector<std::string> names;
 };
 
-/// \brief Make a node which aggregates input batches, optionally grouped by 
keys.
+/// \brief Make a node which aggregates input batches, optionally grouped by 
keys and
+/// optionally segmented by segment-keys. Both keys and segment-keys determine 
the group.
+/// However segment-keys are also used for determining grouping segments, 
which should be
+/// large, and allow streaming a partial aggregation result after processing 
each segment.

Review Comment:
   The definition is: when moving from one row to the next, if the tuple of 
segment keys is constant then both rows are in the same segment, whereas if the 
tuple changes in any way (not necessarily in lexicographic order) then they are 
in different segments - see also [this 
post](https://github.com/apache/arrow/pull/14352#issuecomment-1272400060).
   
   There are use cases for any segment size. If the segments are smaller, there 
will be more group-by processing overhead, at least due to processing of more 
batches and due to reinitializing the states on each segment crossing, and the 
resulting stream will generate more batches and with lower latency.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rtpsw commented on a diff in pull request #14352: ARROW-17642: [C++] Add ordered aggregation

Reply via email to