Dandandan opened a new pull request, #21654:
URL: https://github.com/apache/datafusion/pull/21654
## Which issue does this PR close?
- Performance improvement, no specific issue.
## Rationale for this change
During hash aggregation, the hash table in `GroupValues` starts with a small
or zero initial capacity and grows dynamically as new groups are discovered.
Each resize requires rehashing all existing entries, which is expensive for
high-cardinality group-by queries.
When column statistics include `distinct_count` (e.g. from Parquet
metadata), we can estimate the number of groups upfront and pre-allocate the
hash table to avoid repeated rehashing.
## What changes are included in this PR?
- In `GroupedHashAggregateStream::new()`, compute the NDV (number of
distinct values) estimate from child statistics using
`AggregateExec::compute_group_ndv()`, bounded by 128K entries
- Pass this capacity hint through `new_group_values()` to all `GroupValues`
implementations:
- `GroupValuesPrimitive` - pre-sizes `HashTable` and values `Vec`
- `GroupValuesColumn` - pre-sizes `HashTable`
- `GroupValuesRows` - pre-sizes `HashTable` and row buffer
- `GroupValuesBytes` / `GroupValuesBytesView` - pre-sizes underlying
`ArrowBytesMap` / `ArrowBytesViewMap`
- Add `with_capacity()` constructors to `ArrowBytesMap` and
`ArrowBytesViewMap`
## Are these changes tested?
Covered by existing aggregation tests. The change is transparent - it only
affects initial allocation sizes, not correctness.
## Are there any user-facing changes?
No user-facing API changes. Aggregation queries may use slightly more
initial memory but avoid rehashing overhead, improving performance for queries
where NDV statistics are available.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]