zanmato1984 commented on code in PR #45789:
URL: https://github.com/apache/arrow/pull/45789#discussion_r2003143566
##########
cpp/src/arrow/acero/groupby_aggregate_node.cc:
##########
@@ -312,7 +312,7 @@ Result<ExecBatch> GroupByNode::Finalize() {
segment_key_field_ids_.size());
// Segment keys come first
- PlaceFields(out_data, 0, segmenter_values_);
+ PlaceFields(out_data, 0, state->segmenter_values);
Review Comment:
It can be the case, but may not be that useful. Because IMO part of the
point of segmented aggregation is to emit the result in a streaming fashion,
that is, as soon as a segment is concluded to be "close", we are sure that the
current aggregation result is a valid partial result thus can be output to
downstream. Rechunking the input would require all input batches to be
accumulated already.
To multi-threading the segmented aggregation, I would imagine the batches to
be already partitioned by (and sorted by, of course, this is already implied by
the current single-threaded impl) segment keys and distributed to specific
threads (# partition == # thread). This can be achieved by a special source
node or a "shuffle" node.
This way each thread of the aggregate can process all rows belonging to a
specific segment. This would require some modification to the current aggregate
node such as not merging other thread states, or just have a brand new
"partitioned aggregate" node.
Anyway, it's not trivial and can be quite restrictive to use.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]