alamb commented on issue #17446:
URL: https://github.com/apache/datafusion/issues/17446#issuecomment-3347939002

   > i tried to disable partial aggregation
   > 
   > ```
   > SET datafusion.execution.skip_partial_aggregation_probe_ratio_threshold=0;
   > SET datafusion.execution.skip_partial_aggregation_probe_rows_threshold=0;
   > ```
   > 
   > The performance reduced down to 2s, but still significantly slower than 
polar, i'll need to take a closer look on how their internal execution plan 
looks like
   
   Thanks for this investigation @duongcongtoai 
   
   > I also notice that AccumulatorState::size takes up to 30% of CPU
   
   I think this is an artifact that  function of using the (old, per row) 
[`Accumulator`](https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html)
 trait is quite slow compared to the more optimal  (but harder to implement) 
[`GroupsAccumulator`](https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html)
 trait
   
   There is more background information on this topic here: 
https://datafusion.apache.org/blog/2023/08/05/datafusion_fast_grouping/
   
   So I think to be competitive with Polars we will need to implement a 
`GroupsAccumulator` for array_agg
   
   I actually think this might be straight forward since array_agg isn't really 
aggregating anything
   
   Specifically, I think you might be able to create something relatively 
simple for the state
   * Save the group assignments and batches as they come in (array_agg needs to 
store all the rows)
   * On emit, form the correct outputs by copying to the output
   
   
   - That might also fix https://github.com/apache/datafusion/issues/16517
   
   Note there is a previous attempt at a GroupsAccumulator in  this PR from @lkt
   - https://github.com/apache/datafusion/pull/10149
   
   FYI @gabotechs as you have previously shown interest in improving 
array_agg's memory profile


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to