Re: [PR] feat: Add distinct accumulator for `Perfect Hash Join` [datafusion]

via GitHub Tue, 16 Sep 2025 21:39:50 -0700


adriangb commented on code in PR #17606:
URL: https://github.com/apache/datafusion/pull/17606#discussion_r2354278704



##########
datafusion/functions-aggregate/src/count.rs:
##########
@@ -746,12 +746,25 @@ fn null_count_for_multiple_cols(values: &[ArrayRef]) -> 
usize {
 /// more efficient such as [`PrimitiveDistinctCountAccumulator`] and
 /// [`BytesDistinctCountAccumulator`]
 #[derive(Debug)]
-struct DistinctCountAccumulator {
+pub struct DistinctCountAccumulator {
     values: HashSet<ScalarValue, RandomState>,

Review Comment:
   I don't know much about "perfect join" but I worry that tracking distinct 
values is considerable overhead. Do we do it *only* when we know we will 100% 
for sure be able to do a "perfect join"? If the cardinality is high or other 
pathological cases, is it still always faster to track the distinct values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add distinct accumulator for `Perfect Hash Join` [datafusion]

Reply via email to