adriangb commented on PR #22000: URL: https://github.com/apache/datafusion/pull/22000#issuecomment-4367507046
I've looked into the history of https://github.com/apache/datafusion/issues/16533 a bit. In particular the decision in https://github.com/apache/datafusion/pull/16505 to add hooks (https://github.com/apache/datafusion/pull/17843/) instead of integrating into core directly. The example in https://github.com/apache/datafusion/pull/17843/ does post-scan sampling, once the IO cost has been paid. This PR attempts to do much more efficient IO level sampling (files, row groups, pages). In order for this to work we need to add APIs in `ParquetSource`, `ExecutionPlan`, etc. At that point I'm not sure an "external extension" approach is worth it. It seems that it's simpler to just bake the feature into DataFusion. The implementation in this PR splits the difference: it uses the `RelationPlanner` and extension logical plan nodes, but bundles it with `SessionStateBuilder::with_default_features()`. Personally I would like to at least add `Sample` to `enum LogicalPlan`, I don't find the WASM argument very strong but maybe that's just me. @geoffreyclaude @alamb wdyt? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
