[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

GitBox Wed, 16 Dec 2020 06:41:47 -0800


rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r544337093




##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 
     /// Returns the table Statistics
     /// Statistics should be optional because not all data sources can provide 
statistics.
     fn statistics(&self) -> Statistics;
+
+    /// Tests whether the table provider can make use of a filter expression
+    /// to optimise data retrieval.
+    fn test_filter_pushdown(

Review comment:
       > I'm not familiar enough with partitioning in DataFusion yet to quite 
understand what you mean for the different pushdown support return values 
across partitions
   
   Sorry, I was not very clear! Let me try to do better 😄 
   
   When the logical plan is converted, DataFusion allows the target physical 
plan to specify a number of partitions. This typically allows you to use that 
many threads (or workers) to execute your physical plan. In the context of 
datasources, this is often the number of files, as they can always be decoded 
separately and in parallel. But it could be any other relevant partitioning of 
your data. My point, is that you will often have the statistics at the 
granularity of the file <=> DataFusion partitions. 
   
   So the workflow here will be the following:
   - In the logical plan, you are evaluating the pushed-down expressions on the 
entire datasource (all of its files) to know if some of the expressions can be 
simplified out (using `supports_filter_pushdown()`). To achieve this in your 
TableProvider implem, you will need to merge all the statistics of your files 
into one that characterizes the entire datasource, then select the expressions.
   - You will likely need to do the same expression pruning again at the 
partition level (so in the physical expression). Indeed each file will have 
more focused (= accurate) statistics that you will want to take advantage of, 
because you will likely be able to prune out more parts of the filter 
expression.
   
   My question is whether it is worth adding the complexity of trying to 
simplify the filtering expression here in the logical plan when it will better 
done at the partition level in the physical plan.

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       I agree the names explicit! 
   
   I am challenging whether having 3 values is necessary. `Unsupported` vs 
`Inexact` is something that you don't need to expose publicly here as they are 
used internally by the the TableProvider when they will be passed back to the 
`scan()` method (or not). Said otherwise, you can leave it to the the 
implementation of `scan()` to decide what to do with the expressions, you don't 
need do do it with `supports_filter_pushdown()`. Only `Exact` vs 
`Unsupported|Inexact` really needs to be exposed because it is used externally 
(by the planner)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rdettai commented on a change in pull request #8917: ARROW-9828: [Rust] [DataFusion] Support filter pushdown optimisation for TableProvider implementations

Reply via email to