rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543418383



##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],

Review comment:
       I would be tempted to make this async in case filtering requires some IO.

##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        _filters: &[Expr],

Review comment:
       Shouldn't we try to see how this could work with parquet and its footer 
metadata?

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
         &self,
         projection: &Option<Vec<usize>>,
         batch_size: usize,
+        filters: &[Expr],
     ) -> Result<Arc<dyn ExecutionPlan>>;
 
     /// Returns the table Statistics
     /// Statistics should be optional because not all data sources can provide 
statistics.
     fn statistics(&self) -> Statistics;
+
+    /// Tests whether the table provider can make use of a filter expression
+    /// to optimise data retrieval.
+    fn test_filter_pushdown(

Review comment:
       I feel this makes us evaluate the expression twice: once for testing 
filter support and once at scan time. Moreover, it is likely that different 
partitions will have different `TableProviderFilterPushDown` values. It would 
be great optimize out filters on that finer granularity 🚀 But that would imply 
doing that optimization in the physical plan.

##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
     pub total_byte_size: Option<usize>,
 }
 
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+    /// The expression cannot be used by the provider.
+    Unsupported,
+    /// The expression can be used to help minimise the data retrieved,
+    /// but the provider cannot guarantee that all returned tuples

Review comment:
       How will the distinction between a filter that is `Unsupported` and 
`Inexact` be leveraged? If `TableProviderFilterPushDown` is only used to know 
which filter expression can be optimized out, I think that a boolean would be 
enough.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to