Re: [PR] feat: TABLESAMPLE SYSTEM end-to-end + row-group / row sampling on ParquetSource [datafusion]

via GitHub Sun, 03 May 2026 17:05:27 -0700


adriangb commented on PR #22000:
URL: https://github.com/apache/datafusion/pull/22000#issuecomment-4367507046


   I've looked into the history of 
https://github.com/apache/datafusion/issues/16533 a bit.
   
   In particular the decision in 
https://github.com/apache/datafusion/pull/16505 to add hooks 
(https://github.com/apache/datafusion/pull/17843/) instead of integrating into 
core directly. The example in https://github.com/apache/datafusion/pull/17843/ 
does post-scan sampling, once the IO cost has been paid. This PR attempts to do 
much more efficient IO level sampling (files, row groups, pages). In order for 
this to work we need to add APIs in `ParquetSource`, `ExecutionPlan`, etc. At 
that point I'm not sure an "external extension" approach is worth it. It seems 
that it's simpler to just bake the feature into DataFusion. The implementation 
in this PR splits the difference: it uses the `RelationPlanner` and extension 
logical plan nodes, but bundles it with 
`SessionStateBuilder::with_default_features()`. Personally I would like to at 
least add `Sample` to `enum LogicalPlan`, I don't find the WASM argument very 
strong but maybe that's just me.
   
   @geoffreyclaude @alamb wdyt?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: TABLESAMPLE SYSTEM end-to-end + row-group / row sampling on ParquetSource [datafusion]

Reply via email to