rtpsw commented on code in PR #14352: URL: https://github.com/apache/arrow/pull/14352#discussion_r1019644565
########## cpp/src/arrow/compute/exec/options.h: ########## @@ -106,21 +106,32 @@ class ARROW_EXPORT ProjectNodeOptions : public ExecNodeOptions { std::vector<std::string> names; }; -/// \brief Make a node which aggregates input batches, optionally grouped by keys. +/// \brief Make a node which aggregates input batches, optionally grouped by keys and +/// optionally segmented by segment-keys. Both keys and segment-keys determine the group. +/// However segment-keys are also used for determining grouping segments, which should be +/// large, and allow streaming a partial aggregation result after processing each segment. Review Comment: The definition is: when moving from one row to the next, if the tuple of segment keys is constant then both rows are in the same segment, whereas if the tuple changes in any way (not necessarily in lexicographic order) then they are in different segments - see also [this post](https://github.com/apache/arrow/pull/14352#issuecomment-1272400060). There are use cases for any segment size. If the segments are smaller, there will be more group-by processing overhead, at least due to processing of more batches and due to reinitializing the states on each segment crossing, and the resulting stream will generate more batches and with lower latency. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org