adriangb commented on code in PR #17606:
URL: https://github.com/apache/datafusion/pull/17606#discussion_r2354278704


##########
datafusion/functions-aggregate/src/count.rs:
##########
@@ -746,12 +746,25 @@ fn null_count_for_multiple_cols(values: &[ArrayRef]) -> 
usize {
 /// more efficient such as [`PrimitiveDistinctCountAccumulator`] and
 /// [`BytesDistinctCountAccumulator`]
 #[derive(Debug)]
-struct DistinctCountAccumulator {
+pub struct DistinctCountAccumulator {
     values: HashSet<ScalarValue, RandomState>,

Review Comment:
   I don't know much about "perfect join" but I worry that tracking distinct 
values is considerable overhead. Do we do it *only* when we know we will 100% 
for sure be able to do a "perfect join"? If the cardinality is high or other 
pathological cases, is it still always faster to track the distinct values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to