rdettai commented on a change in pull request #8917:
URL: https://github.com/apache/arrow/pull/8917#discussion_r543418383
##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
&self,
projection: &Option<Vec<usize>>,
batch_size: usize,
+ filters: &[Expr],
Review comment:
I would be tempted to make this async in case filtering requires some IO.
##########
File path: rust/datafusion/src/datasource/parquet.rs
##########
@@ -65,6 +66,7 @@ impl TableProvider for ParquetTable {
&self,
projection: &Option<Vec<usize>>,
batch_size: usize,
+ _filters: &[Expr],
Review comment:
Shouldn't we try to see how this could work with parquet and its footer
metadata?
##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -48,9 +66,19 @@ pub trait TableProvider {
&self,
projection: &Option<Vec<usize>>,
batch_size: usize,
+ filters: &[Expr],
) -> Result<Arc<dyn ExecutionPlan>>;
/// Returns the table Statistics
/// Statistics should be optional because not all data sources can provide
statistics.
fn statistics(&self) -> Statistics;
+
+ /// Tests whether the table provider can make use of a filter expression
+ /// to optimise data retrieval.
+ fn test_filter_pushdown(
Review comment:
I feel this makes us evaluate the expression twice: once for testing
filter support and once at scan time. Moreover, it is likely that different
partitions will have different `TableProviderFilterPushDown` values. It would
be great optimize out filters on that finer granularity 🚀 But that would imply
doing that optimization in the physical plan.
##########
File path: rust/datafusion/src/datasource/datasource.rs
##########
@@ -34,6 +35,23 @@ pub struct Statistics {
pub total_byte_size: Option<usize>,
}
+/// Indicates whether and how a filter expression can be handled by a
+/// TableProvider for table scans.
+#[derive(Debug, Clone)]
+pub enum TableProviderFilterPushDown {
+ /// The expression cannot be used by the provider.
+ Unsupported,
+ /// The expression can be used to help minimise the data retrieved,
+ /// but the provider cannot guarantee that all returned tuples
Review comment:
How will the distinction between a filter that is `Unsupported` and
`Inexact` be leveraged? If `TableProviderFilterPushDown` is only used to know
which filter expression can be optimized out, I think that a boolean would be
enough.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]