returnString commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543527000
##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
&self,
projection: &Option<Vec<usize>>,
batch_size: usize,
+ filters: &[Expr],
) -> Result<Arc<dyn ExecutionPlan>>;
/// Returns the table Statistics
/// Statistics should be optional because not all data sources can provide
statistics.
fn statistics(&self) -> Statistics;
+
+ /// Tests whether the table provider can make use of a filter expression
+ /// to optimise data retrieval.
+ fn test_filter_pushdown(
Review comment:
> Maybe `supports_filter_pushdown`?
That's definitely better, other people seem to like it as well, sold :p
> I feel this makes us evaluate the expression twice: once for testing
filter support and once at scan time. Moreover, it is likely that different
partitions will have different `TableProviderFilterPushDown` values. It would
be great optimize out filters on that finer granularity 🚀 But that would imply
doing that optimization in the physical plan.
Yeah, I think this depends on whether it's preferable to:
- offer extensibility for providers, allowing them to interpret filters
however they like
- promote some notion of external data chunk into an actual DataFusion
concept (I need to read through your catalogue proposal doc properly!) and get
providers to return a list of available chunks with their bounds, then having
DataFusion perform the high-level filtering where possible, and pass that more
selective list of chunk IDs back for actual data retrieval
And personally I see pros and cons for both of those options (flexibility is
cool, but sometimes dangerous).
I'm not familiar enough with partitioning in DataFusion yet to quite
understand what you mean for the different pushdown support return values
across partitions; I would expect that the filter pushdown status is usually a
property of the column in question? e.g. depending on whether a column tracks
sufficient statistics and the provider storage mechanism allows for it
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]