alamb opened a new issue, #17002: URL: https://github.com/apache/datafusion/issues/17002
### Is your feature request related to a problem or challenge? @nuno-faria implemented the core Parquet Metadata caching logic in the following PR: - https://github.com/apache/datafusion/pull/16971 However, it doesn't seem to help certain queries that use statistcs. Specifically, I expect the second time the query is run it should do no network at all because the ParquetMetadata is already cached: ```sql > set datafusion.execution.parquet.cache_metadata = true; 0 row(s) fetched. Elapsed 0.000 seconds. > select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/'; +----------+ | count(*) | +----------+ | 99997497 | +----------+ 1 row(s) fetched. Elapsed 4.632 seconds. > select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/'; +----------+ | count(*) | +----------+ | 99997497 | +----------+ 1 row(s) fetched. Elapsed 2.717 seconds. ``` ### Describe the solution you'd like I would like the queries above to go faster by using the ParquetMetaData cache ### Describe alternatives you've considered I think this is related to the fact that there is a separate path to retrieve statistics for `ListingTable`, specifically https://github.com/apache/datafusion/blob/1452333cf0933d4d8da032af68bc5a3a05c62483/datafusion/datasource-parquet/src/file_format.rs#L975-L974 So to fix this issue, I think what we need to do is to check the FileMetadataCache first before actually fetching any ParquetMetadata ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org