asolimando commented on issue #18628: URL: https://github.com/apache/datafusion/issues/18628#issuecomment-3532487876
Thinking of aggregation and repartitioning, NDV could also help estimating data skewness: the average load per key is `#tuples / NDV(c)`. If this value is higher than a threshold (usually `>> 1` could be a sensible criterion), `round-robin` is better than `hash-repartitioning(c)` as one key would get too much data. Salting could be used as a refinement, but that’s a simple case to add on top of the others. You mentioned join ordering, but also the join selection algorithm could benefit from NDV: if the build side of a hash join has a high NDV, it might be better to go with a nested-loop join or a sort-merge join (depending on the sorting properties of the input operators). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
