tustvold commented on issue #3027: URL: https://github.com/apache/arrow-datafusion/issues/3027#issuecomment-1206544769
So DataFusion has fairly mature support for predicate pruning such as you describe, in particular https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html. Assuming the predicate is pushed down to the [TableScan](https://docs.rs/datafusion/latest/datafusion/logical_plan/struct.TableScan.html). The following should happen automatically. * If using [`ListingTable`](https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html) as the catalog, any non-matching [partitions](https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingOptions.html#structfield.table_partition_cols) will be filtered out * If using `ListingTable` and enabled `collect_stat` the files will be pruned based on their metadata * [ParquetExec](https://docs.rs/datafusion/latest/datafusion/physical_plan/file_format/struct.ParquetExec.html) will prune the row groups based on the filter So at least theoretically, this should already be being performed. Perhaps you might be able to clarify: * Do you have an actual catalog, or are you using `ListingTable`. In particular what [TableProvider](https://docs.rs/datafusion/latest/datafusion/datasource/datasource/trait.TableProvider.html) are you using * Is your data partitioned at all? Is the `TableProvider` aware of this? * Do you have a catalog that can provide file-level metadata? > think the solution here would be to implement a specialized version of https://docs.rs/object_store/latest/object_store/trait.ObjectStore.html#tymethod.list where a function can be passed to determine which files get selected (based on file names). To me this sounds like a quirk of a very specific kind of data catalog. We could potentially add some sort of support for file-name extraction to `ListingTable`, but it is unclear why this would be pushed down to `ObjectStore`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
