rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r544337093
##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
&self,
projection: &Option<Vec<usize>>,
batch_size: usize,
+ filters: &[Expr],
) -> Result<Arc<dyn ExecutionPlan>>;
/// Returns the table Statistics
/// Statistics should be optional because not all data sources can provide
statistics.
fn statistics(&self) -> Statistics;
+
+ /// Tests whether the table provider can make use of a filter expression
+ /// to optimise data retrieval.
+ fn test_filter_pushdown(
Review comment:
> I'm not familiar enough with partitioning in DataFusion yet to quite
understand what you mean for the different pushdown support return values
across partitions
Sorry, I was not very clear! Let me try to do better 😄
When the logical plan is converted, DataFusion allows the target physical
plan to specify a number of partitions. This typically allows you to use that
many threads (or workers) to execute your physical plan. In the context of
datasources, this is often the number of files, as they can always be decoded
separately and in parallel. But it could be any other relevant partitioning of
your data. My point, is that you will often have the statistics at the
granularity of the file <=> DataFusion partitions.
So the workflow here will be the following:
- In the logical plan, you are evaluating the pushed-down expressions on the
entire datasource (all of its files) to know if some of the expressions can be
simplified out (using `supports_filter_pushdown()`). To achieve this in your
TableProvider implem, you will need to merge all the statistics of your files
into one that characterizes the entire datasource, then select the expressions.
- You will likely need to do the same expression pruning again at the
partition level (so in the physical expression). Indeed each file will have
more focused (= accurate) statistics that you will want to take advantage of,
because you will likely be able to prune out more parts of the filter
expression.
My question is whether it is worth adding the complexity of trying to
simplify the filtering expression here in the logical plan when it will better
done at the partition level in the physical plan.
##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
pub total_byte_size: Option<usize>,
}
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+ /// The expression cannot be used by the provider.
+ Unsupported,
+ /// The expression can be used to help minimise the data retrieved,
+ /// but the provider cannot guarantee that all returned tuples
Review comment:
I agree the names explicit!
I am challenging whether having 3 values is necessary. `Unsupported` vs
`Inexact` is something that you don't need to expose publicly here as they are
used internally by the the TableProvider when they will be passed back to the
`scan()` method (or not). Said otherwise, you can leave it to the the
implementation of `scan()` to decide what to do with the expressions, you don't
need do do it with `supports_filter_pushdown()`. Only `Exact` vs
`Unsupported|Inexact` really needs to be exposed because it is used externally
(by the planner)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]