adriangb commented on code in PR #17606: URL: https://github.com/apache/datafusion/pull/17606#discussion_r2354278704
########## datafusion/functions-aggregate/src/count.rs: ########## @@ -746,12 +746,25 @@ fn null_count_for_multiple_cols(values: &[ArrayRef]) -> usize { /// more efficient such as [`PrimitiveDistinctCountAccumulator`] and /// [`BytesDistinctCountAccumulator`] #[derive(Debug)] -struct DistinctCountAccumulator { +pub struct DistinctCountAccumulator { values: HashSet<ScalarValue, RandomState>, Review Comment: I don't know much about "perfect join" but I worry that tracking distinct values is considerable overhead. Do we do it *only* when we know we will 100% for sure be able to do a "perfect join"? If the cardinality is high or other pathological cases, is it still always faster to track the distinct values? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org