Re: [I] DataFusion not using NDV stat [datafusion]

via GitHub Fri, 14 Nov 2025 04:24:47 -0800


asolimando commented on issue #18628:
URL: https://github.com/apache/datafusion/issues/18628#issuecomment-3532487876


   Thinking of aggregation and repartitioning, NDV could also help estimating 
data skewness: the average load per key is `#tuples / NDV(c)`. If this value is 
higher than a threshold (usually `>> 1` could be a sensible criterion), 
`round-robin` is better than `hash-repartitioning(c)` as one key would get too 
much data. Salting could be used as a refinement, but that’s a simple case to 
add on top of the others.
   
   You mentioned join ordering, but also the join selection algorithm could 
benefit from NDV: if the build side of a hash join has a high NDV, it might be 
better to go with a nested-loop join or a sort-merge join (depending on the 
sorting properties of the input operators).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] DataFusion not using NDV stat [datafusion]

Reply via email to