zanmato1984 commented on code in PR #45789:
URL: https://github.com/apache/arrow/pull/45789#discussion_r2003143566


##########
cpp/src/arrow/acero/groupby_aggregate_node.cc:
##########
@@ -312,7 +312,7 @@ Result<ExecBatch> GroupByNode::Finalize() {
                          segment_key_field_ids_.size());
 
   // Segment keys come first
-  PlaceFields(out_data, 0, segmenter_values_);
+  PlaceFields(out_data, 0, state->segmenter_values);

Review Comment:
   It can be the case, but may not be that useful. Because IMO part of the 
point of segmented aggregation is to emit the result in a streaming fashion, 
that is, as soon as a segment is concluded to be "close", we are sure that the 
current aggregation result is a valid partial result thus can be output to 
downstream. Rechunking the input would require all input batches to be 
accumulated already.
   
   To multi-threading the segmented aggregation, I would imagine the batches to 
be already partitioned by (and sorted by, of course, this is already implied by 
the current single-threaded impl) segment keys and distributed to specific 
threads (# partition == # thread). This can be achieved by a special source 
node or a "shuffle" node.
   
   This way each thread of the aggregate can process all rows belonging to a 
specific segment. This would require some modification to the current aggregate 
node such as not merging other thread states, or just have a brand new 
"partitioned aggregate" node.
   
   Anyway, it's not trivial and can be quite restrictive to use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to