kylebarron opened a new issue, #5582:
URL: https://github.com/apache/arrow-rs/issues/5582

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   In some multi-file Parquet dataset layouts, there is a sidecar metadata 
file, canonically named `_metadata`, which holds only the metadata for each row 
group in the dataset. See 
https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files:
   
   > Some processing frameworks such as Spark or Dask (optionally) use 
`_metadata` and `_common_metadata` files with partitioned datasets.
   
   > Those files include information about the schema of the full dataset (for 
`_common_metadata`) and potentially all row group metadata of all files in the 
partitioned dataset as well (for `_metadata`). The actual files are 
metadata-only Parquet files. Note this is not a Parquet standard, but a 
convention set in practice by those frameworks.
   
   > Using those files can give a more efficient creation of a parquet Dataset, 
since it can use the stored schema and file paths of all row groups, instead of 
inferring the schema and crawling the directories for all Parquet files (this 
is especially the case for filesystems where accessing files is expensive).
   
   I'd like to be able to use such metadata files to accelerate reading of 
Parquet datasets in [geoarrow-rs](https://github.com/geoarrow/geoarrow-rs). 
Mimicking pyarrow's API, I currently have a [`ParquetFile` 
struct](https://github.com/geoarrow/geoarrow-rs/blob/8a9385eeeebe434ab49efbae830666e3a3997f6a/src/io/parquet/reader/async.rs#L69-L74),
 which is backed by a single `R: AsyncFileReader`, as well as a 
[`ParquetDataset` 
struct](https://github.com/geoarrow/geoarrow-rs/blob/8a9385eeeebe434ab49efbae830666e3a3997f6a/src/io/parquet/reader/async.rs#L263-L267),
 which is backed by `Vec<ParquetFile<R>>, where R: AsyncFileReader`. This 
allows concurrent async reads across multiple files.
   
   I'd like to have a `ParquetDataset::from_metadata` method, which constructs 
itself from a `_metadata` file. But to do that I need to be able to construct 
`ArrowReaderMetadata` for each underlying file. This is entirely possible with 
existing APIs, except that `ArrowReaderMetadata::try_new` has visibility 
`pub(crate)`.
   
   
   **Describe the solution you'd like**
   
   Give `ArrowReaderMetadata::try_new` full public visibility.
   
   **Describe alternatives you've considered**
   
   Unsure of alternatives.
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to