nuno-faria commented on code in PR #17022:
URL: https://github.com/apache/datafusion/pull/17022#discussion_r2262530221
##########
datafusion/datasource-parquet/src/file_format.rs:
##########
@@ -968,19 +973,41 @@ pub async fn fetch_parquet_metadata(
meta: &ObjectMeta,
size_hint: Option<usize>,
#[allow(unused)] decryption_properties: Option<&FileDecryptionProperties>,
-) -> Result<ParquetMetaData> {
+ file_metadata_cache: Option<Arc<dyn FileMetadataCache>>,
+) -> Result<Arc<ParquetMetaData>> {
+ if let Some(cache) = &file_metadata_cache {
+ if let Some(cached_metadata) = cache.get(meta) {
+ if let Some(parquet_metadata) = cached_metadata
+ .as_any()
+ .downcast_ref::<CachedParquetMetaData>()
+ {
+ return Ok(Arc::clone(parquet_metadata.parquet_metadata()));
+ }
+ }
+ }
+
let file_size = meta.size;
let fetch = ObjectStoreFetch::new(store, meta);
let reader = ParquetMetaDataReader::new().with_prefetch_hint(size_hint);
-
#[cfg(feature = "parquet_encryption")]
let reader = reader.with_decryption_properties(decryption_properties);
- reader
- .load_and_finish(fetch, file_size)
- .await
- .map_err(DataFusionError::from)
+ let metadata = Arc::new(
+ reader
+ .load_and_finish(fetch, file_size)
+ .await
+ .map_err(DataFusionError::from)?,
+ );
+
+ if let Some(cache) = file_metadata_cache {
+ cache.put(
+ meta,
+ Arc::new(CachedParquetMetaData::new(Arc::clone(&metadata))),
+ );
+ }
Review Comment:
I think there is an issue with the `fetch_parquet_metadata` function. When
this function is initially called to retrieve the schema (in `fetch_schema`),
it will read the metadata and update the cache. When the
`CachedParquetFileReader` tries to get the metadata, it checks that it is
present in the cache. However, the cached metadata does not contain the page
index, as it is not retrieved in the `fetch_parquet_metadata`, meaning it will
have to be read in every query.
So this `fetch_parquet_metadata` needs to retrieve the entire metadata for
the caching to be effective. On the other hand, after this we will have two
different places where the entire metadata is read and cached
(`CachedParquetFileReader` and `fetch_parquet_metadata`), so creating an
utility function `retrieve_full_parquet_metadata -> Arc<ParquetMetaData>` might
be useful to avoid duplicate modifications in the future.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]