asolimando commented on PR #22144:
URL: https://github.com/apache/datafusion/pull/22144#issuecomment-4465650148

   @adriangb do you foresee a moment where this is not gated by a config knob?
   
   I have in mind what happens between partial and final aggregation phases, 
where partial bails out after 100k rows if NDV is too high (groups number is 
proportional to the size of the input).
   
   Runtime adaptivity is a real need and often a good idea, but in some 
pathological cases there can be false positives: by chance all your distinct 
values are at the beginning, or your selectivity for the first x row 
groups/files is low, but that finally doesn't reflect the real distribution, 
and you'd rather trust global statistics from files/catalog otherwise, possibly 
confirming with a feedback loop from runtime stats.
   
   It would be great to gate these runtime adaptivity techniques or at least 
expose the thresholds programmatically to adjust their sensitivity based on 
stats for your workload.
   
   WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to