alamb opened a new issue, #17002:
URL: https://github.com/apache/datafusion/issues/17002

   ### Is your feature request related to a problem or challenge?
   
   @nuno-faria implemented the core Parquet Metadata caching logic in the 
following PR:
   - https://github.com/apache/datafusion/pull/16971
   
   However, it doesn't seem to help certain queries that use statistcs. 
Specifically, I expect the second time the query is run it should do no network 
at all because the ParquetMetadata is already cached:
   
   ```sql
   > set datafusion.execution.parquet.cache_metadata = true;
   0 row(s) fetched.
   Elapsed 0.000 seconds.
   
   > select count(*) from 
's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
   +----------+
   | count(*) |
   +----------+
   | 99997497 |
   +----------+
   1 row(s) fetched.
   Elapsed 4.632 seconds.
   
   > select count(*) from 
's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
   +----------+
   | count(*) |
   +----------+
   | 99997497 |
   +----------+
   1 row(s) fetched.
   Elapsed 2.717 seconds.
   ```
   
   ### Describe the solution you'd like
   
   I would like the queries above to go faster by using the ParquetMetaData 
cache
   
   ### Describe alternatives you've considered
   
   I think this is related to the fact that there is a separate path to 
retrieve statistics for `ListingTable`, specifically 
https://github.com/apache/datafusion/blob/1452333cf0933d4d8da032af68bc5a3a05c62483/datafusion/datasource-parquet/src/file_format.rs#L975-L974
   
   So to fix this issue, I think what we need to do is to check the 
FileMetadataCache first before actually fetching any ParquetMetadata
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to