ShashidharM0118 opened a new pull request, #18763:
URL: https://github.com/apache/datafusion/pull/18763
## Which issue does this PR close?
- Closes #18670
## Rationale for this change
As described in #18670, `GenericDistinctBuffer` previously forced all types
through the `Hashable<T>` wrapper, even for natively hashable types like
integers. This was necessary to handle floats (which don't implement `Hash`
natively), but it meant that types with efficient native `Hash` implementations
(like `i32`, `u64`, etc.) were unnecessarily wrapped, potentially impacting
performance.
This PR introduces a trait-based solution that allows
`GenericDistinctBuffer` to work generically with both native hashable types and
types requiring the `Hashable` wrapper, without code duplication.
## What changes are included in this PR?
1. **Introduced `DistinctStorage` trait**: A new trait that abstracts over
different storage strategies for distinct value tracking. It provides two
implementations:
- `HashSet<T, RandomState>` for natively hashable types (integers,
decimals, dates, timestamps)
- `HashSet<Hashable<T>, RandomState>` for types requiring special hashing
(floats)
2. **Made `GenericDistinctBuffer` generic over storage**: Changed from
`GenericDistinctBuffer<T>` to `GenericDistinctBuffer<T, S>` where `S:
DistinctStorage<Native = T::Native>`. This allows the buffer to use the
appropriate storage strategy based on the type.
3. **Unified distinct count accumulators**: Merged
`PrimitiveDistinctCountAccumulator` and `FloatDistinctCountAccumulator` into a
single generic implementation `PrimitiveDistinctCountAccumulator<T, S>`. The
former `FloatDistinctCountAccumulator` is now a type alias using `Hashable`
storage.
4. **Updated all usages**: Modified all call sites in `count.rs`,
`sum_distinct`, `median`, and `percentile_cont` to specify the appropriate
storage type (native `HashSet<T>` for integers/dates/timestamps,
`HashSet<Hashable<T>>` for floats).
5. **Added `Send + Sync` bounds**: Added necessary trait bounds for proper
thread-safe trait implementations in DataFusion's parallel execution
environment.
## Are these changes tested?
Yes
## Are there any user-facing changes?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]