asolimando commented on PR #22144: URL: https://github.com/apache/datafusion/pull/22144#issuecomment-4465650148
@adriangb do you foresee a moment where this is not gated by a config knob? I have in mind what happens between partial and final aggregation phases, where partial bails out after 100k rows if NDV is too high (groups number is proportional to the size of the input). Runtime adaptivity is a real need and often a good idea, but in some pathological cases there can be false positives: by chance all your distinct values are at the beginning, or your selectivity for the first x row groups/files is low, but that finally doesn't reflect the real distribution, and you'd rather trust global statistics from files/catalog otherwise, possibly confirming with a feedback loop from runtime stats. It would be great to gate these runtime adaptivity techniques or at least expose the thresholds programmatically to adjust their sensitivity based on stats for your workload. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
