Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

via GitHub Wed, 13 Aug 2025 07:23:35 -0700


aditanase commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-3184147237

@theirix many Thanks for this PR! The `SYSTEM` sampling strategy is really
useful for large tables, and I've seen more interesting variants on this in the
ray project, where shuffling input data is a key requirement for training
robust models:
https://docs.ray.io/en/latest/data/shuffling-data.html

At our company we're heavy users of delta table format and we're using
`delta-rs` to consume them.
In that scenario, you can do file pruning (partitions, statistics) and limit
pushdowns from the metadata layer, which happens as part of predicate pushdown
and limit pushdown.

I've recently authored a PR that enables LIMIT pushdown for partition
predicates for delta tables:

https://github.com/delta-io/delta-rs/pull/3436/commits/15f2ade11c8627173bfca6568c9f3f6f2dd6c619

Have you considered this use case? Wondering what we would need to do at
this stage to pass in enough information in the datasource to turn the SYSTEM
hint in a metadata operation and keep other optimizations alive. In the
delta-rs case we could simply randomize the order of active files before
pruning them.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

Reply via email to