tustvold commented on issue #3027:
URL: 
https://github.com/apache/arrow-datafusion/issues/3027#issuecomment-1206544769

   So DataFusion has fairly mature support for predicate pruning such as you 
describe, in particular 
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html.
   
   Assuming the predicate is pushed down to the 
[TableScan](https://docs.rs/datafusion/latest/datafusion/logical_plan/struct.TableScan.html).
 The following should happen automatically.
   
   * If using 
[`ListingTable`](https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html)
 as the catalog, any non-matching 
[partitions](https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingOptions.html#structfield.table_partition_cols)
 will be filtered out
   * If using `ListingTable` and enabled `collect_stat` the files will be 
pruned based on their metadata
   * 
[ParquetExec](https://docs.rs/datafusion/latest/datafusion/physical_plan/file_format/struct.ParquetExec.html)
 will prune the row groups based on the filter
   
   So at least theoretically, this should already be being performed. Perhaps 
you might be able to clarify:
   
   * Do you have an actual catalog, or are you using `ListingTable`. In 
particular what 
[TableProvider](https://docs.rs/datafusion/latest/datafusion/datasource/datasource/trait.TableProvider.html)
 are you using
   * Is your data partitioned at all? Is the `TableProvider` aware of this?
   * Do you have a catalog that can provide file-level metadata?
   
   >  think the solution here would be to implement a specialized version of 
https://docs.rs/object_store/latest/object_store/trait.ObjectStore.html#tymethod.list
 where a function can be passed to determine which files get selected (based on 
file names).
   
   To me this sounds like a quirk of a very specific kind of data catalog. We 
could potentially add some sort of support for file-name extraction to 
`ListingTable`, but it is unclear why this would be pushed down to 
`ObjectStore`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to