[PR] Use NDV estimate to pre-allocate hash tables during aggregation [datafusion]

via GitHub Wed, 15 Apr 2026 12:43:04 -0700


Dandandan opened a new pull request, #21654:
URL: https://github.com/apache/datafusion/pull/21654


   ## Which issue does this PR close?
   
   - Performance improvement, no specific issue.
   
   ## Rationale for this change
   
   During hash aggregation, the hash table in `GroupValues` starts with a small 
or zero initial capacity and grows dynamically as new groups are discovered. 
Each resize requires rehashing all existing entries, which is expensive for 
high-cardinality group-by queries.
   
   When column statistics include `distinct_count` (e.g. from Parquet 
metadata), we can estimate the number of groups upfront and pre-allocate the 
hash table to avoid repeated rehashing.
   
   ## What changes are included in this PR?
   
   - In `GroupedHashAggregateStream::new()`, compute the NDV (number of 
distinct values) estimate from child statistics using 
`AggregateExec::compute_group_ndv()`, bounded by 128K entries
   - Pass this capacity hint through `new_group_values()` to all `GroupValues` 
implementations:
     - `GroupValuesPrimitive` - pre-sizes `HashTable` and values `Vec`
     - `GroupValuesColumn` - pre-sizes `HashTable`
     - `GroupValuesRows` - pre-sizes `HashTable` and row buffer
     - `GroupValuesBytes` / `GroupValuesBytesView` - pre-sizes underlying 
`ArrowBytesMap` / `ArrowBytesViewMap`
   - Add `with_capacity()` constructors to `ArrowBytesMap` and 
`ArrowBytesViewMap`
   
   ## Are these changes tested?
   
   Covered by existing aggregation tests. The change is transparent - it only 
affects initial allocation sizes, not correctness.
   
   ## Are there any user-facing changes?
   
   No user-facing API changes. Aggregation queries may use slightly more 
initial memory but avoid rehashing overhead, improving performance for queries 
where NDV statistics are available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Use NDV estimate to pre-allocate hash tables during aggregation [datafusion]

Reply via email to