aditanase commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3184147237

   @theirix many Thanks for this PR! The `SYSTEM` sampling strategy is really 
useful for large tables, and I've seen more interesting variants on this in the 
ray project, where shuffling input data is a key requirement for training 
robust models: 
   https://docs.ray.io/en/latest/data/shuffling-data.html
   
   At our company we're heavy users of delta table format and we're using 
`delta-rs` to consume them.
   In that scenario, you can do file pruning (partitions, statistics) and limit 
pushdowns from the metadata layer, which happens as part of predicate pushdown 
and limit pushdown.
   
   I've recently authored a PR that enables LIMIT pushdown for partition 
predicates for delta tables:
   
https://github.com/delta-io/delta-rs/pull/3436/commits/15f2ade11c8627173bfca6568c9f3f6f2dd6c619
   
   Have you considered this use case? Wondering what we would need to do at 
this stage to pass in enough information in the datasource to turn the SYSTEM 
hint in a metadata operation and keep other optimizations alive. In the 
delta-rs case we could simply randomize the order of active files before 
pruning them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to