alamb commented on code in PR #17127: URL: https://github.com/apache/datafusion/pull/17127#discussion_r2267708500
########## datafusion/datasource-parquet/src/file_format.rs: ########## @@ -1038,98 +1015,32 @@ impl MetadataFetch for ObjectStoreFetch<'_> { /// through [`ParquetFileReaderFactory`]. /// /// [`ParquetFileReaderFactory`]: crate::ParquetFileReaderFactory -pub async fn fetch_parquet_metadata<F: MetadataFetch>( - fetch: F, +#[deprecated( Review Comment: I left all the existing public APIs and deprecated them, and updated them to call the new DFParquetMetadata structure ########## datafusion/datasource-parquet/src/file_format.rs: ########## @@ -1935,40 +1688,9 @@ async fn output_single_parquet_file_parallelized( Ok(file_metadata) } -/// Min/max aggregation can take Dictionary encode input but always produces unpacked Review Comment: I am quite please that most of the statistics handling is now consolidated into its own module ########## datafusion/datasource-parquet/src/reader.rs: ########## @@ -256,44 +258,19 @@ impl AsyncFileReader for CachedParquetFileReader { #[cfg(not(feature = "parquet_encryption"))] let file_decryption_properties = None; - fetch_parquet_metadata( - &mut self.inner, - &file_meta.object_meta, - None, - file_decryption_properties, - Some(metadata_cache), - ) - .await - .map_err(|e| { - parquet::errors::ParquetError::General(format!( - "Failed to fetch metadata for file {}: {e}", - file_meta.object_meta.location, - )) - }) + // TODO should there be metadata prefetch hint here? Review Comment: The metadata prefetch hint isn't passed here (it isn't on `main` either) but this refactor leads me to believe it might be helpful to do so 🤔 ########## datafusion/core/src/datasource/file_format/parquet.rs: ########## @@ -195,31 +194,24 @@ mod tests { let format = ParquetFormat::default().with_force_view_types(force_views); let schema = format.infer_schema(&ctx, &store, &meta).await?; - let stats = fetch_statistics( - store.as_ref(), - schema.clone(), - &meta[0], - None, - None, - Some(ctx.runtime_env().cache_manager.get_file_metadata_cache()), - ) - .await?; + let file_metadata_cache = Review Comment: this shows the key API difference -- instead of calling a bunch of free functions, you now construct a `DFParquetMetadata` and call methods on that struct instead ########## datafusion/core/src/datasource/file_format/parquet.rs: ########## @@ -392,51 +384,27 @@ mod tests { // Use a size hint larger than the parquet footer but smaller than the actual metadata, requiring a second fetch // for the remaining metadata - fetch_parquet_metadata( - ObjectStoreFetch::new(store.as_ref() as &dyn ObjectStore, &meta[0]), - &meta[0], - Some(9), - None, - None, - ) - .await - .expect("error reading metadata with hint"); + let file_metadata_cache = + ctx.runtime_env().cache_manager.get_file_metadata_cache(); + let df_meta = DFParquetMetadata::new(store.as_ref(), &meta[0]) + .with_metadata_size_hint(Some(9)); + df_meta.fetch_metadata().await?; assert_eq!(store.request_count(), 2); + let df_meta = + df_meta.with_file_metadata_cache(Some(Arc::clone(&file_metadata_cache))); + // Increases by 3 because cache has no entries yet - fetch_parquet_metadata( Review Comment: I think the new struct makes it much clearer what is being tested vs what is test setup functionality and I find the updated tests to be much easier to read ########## datafusion/datasource-parquet/src/file_format.rs: ########## @@ -306,30 +301,6 @@ fn clear_metadata( }) } -async fn fetch_schema_with_location( Review Comment: Most of this PR is moving code in this module into `metadata.rs` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org