aditanase commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3184147237
@theirix many Thanks for this PR! The `SYSTEM` sampling strategy is really useful for large tables, and I've seen more interesting variants on this in the ray project, where shuffling input data is a key requirement for training robust models: https://docs.ray.io/en/latest/data/shuffling-data.html At our company we're heavy users of delta table format and we're using `delta-rs` to consume them. In that scenario, you can do file pruning (partitions, statistics) and limit pushdowns from the metadata layer, which happens as part of predicate pushdown and limit pushdown. I've recently authored a PR that enables LIMIT pushdown for partition predicates for delta tables: https://github.com/delta-io/delta-rs/pull/3436/commits/15f2ade11c8627173bfca6568c9f3f6f2dd6c619 Have you considered this use case? Wondering what we would need to do at this stage to pass in enough information in the datasource to turn the SYSTEM hint in a metadata operation and keep other optimizations alive. In the delta-rs case we could simply randomize the order of active files before pruning them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org