Rachelint commented on issue #20773:
URL: https://github.com/apache/datafusion/issues/20773#issuecomment-4017282421
My concerns are that:
- The paper describe about how to keep the hashtable small and
cache-efficient in high cardinality case
- And situation seems in datafuson:
- In `high cardinality` case, `datafusion` use `skip aggregations
approach` to even totally skip the `partial aggr`, and skipping possible to be
a better approach(`doris` use the similar approach).
- In `low cardinality` case, the hashtable usually small enough without
extra processing?
> The description in DuckDB blog sounds to me very similar/the same as the
paper, "spills when ht becomes full" means there is some threshold on the
hashtable, after which it will repartition/spill to the local hash tables.
Method in [Leis et
al](https://15721.courses.cs.cmu.edu/spring2016/papers/p743-leis.pdf) may be
more similar as current duckdb(the approach is mentioned together with their
external aggregtion in https://duckdb.org/2024/03/29/external-aggregation), it
is different with duckdb 0.7:
- Only one small enough fixed hashtable in `partial aggr` thread, and only
partition the group values
- When hashtable full, flush the partitioned group values
- Clear and reuse the hashtable
- And in `final aggr`, what `current duckdb` do is similar as `datafusion`,
always perform partition-wise parallel merge (`duckdb 0.7` will perform single
thread merge when groups are few)
As I see, `making partial aggr cache-friendly` mentioned in paper(and
current duckdb) is actually `making the hashtable in partial aggr
cache-friendly`.
And the paper and current duckdb use reset + rebuild method to reach it even
in `high cardinality` case.
Duckdb 0.7 ([method in the
blog](https://duckdb.org/2022/03/07/aggregate-hashtable#parallel-aggregation)
seems to only describe how they use partitioned hashtable to support parallel
merge in `final aggr`. And actually partitioned hashtable help few
with`cache-friendly` due to some experiment I did before.
> The problem with spilling to repartition / final aggregation
(https://github.com/apache/datafusion/issues/6937#issuecomment-1681310199) is
that it will reduce keys less, and thus cause much more cross communication /
repartition overhead and cost in the upstream aggregation.
Agree with "flush" is too expansive here, that possible lead to no obvious
improvement.
It may be not the good case to measure if the method in [Leis et
al](https://15721.courses.cs.cmu.edu/spring2016/papers/p743-leis.pdf) can also
work well in `datafusion`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]