[I] Create `ArrowReaderMetadata` from externalized metadata [arrow-rs]

via GitHub Tue, 02 Apr 2024 14:25:01 -0700


kylebarron opened a new issue, #5582:
URL: https://github.com/apache/arrow-rs/issues/5582

**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**

In some multi-file Parquet dataset layouts, there is a sidecar metadata
file, canonically named `_metadata`, which holds only the metadata for each row
group in the dataset. See
https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files:

> Some processing frameworks such as Spark or Dask (optionally) use
`_metadata` and `_common_metadata` files with partitioned datasets.

> Those files include information about the schema of the full dataset (for
`_common_metadata`) and potentially all row group metadata of all files in the
partitioned dataset as well (for `_metadata`). The actual files are
metadata-only Parquet files. Note this is not a Parquet standard, but a
convention set in practice by those frameworks.

> Using those files can give a more efficient creation of a parquet Dataset,
since it can use the stored schema and file paths of all row groups, instead of
inferring the schema and crawling the directories for all Parquet files (this
is especially the case for filesystems where accessing files is expensive).

I'd like to be able to use such metadata files to accelerate reading of
Parquet datasets in [geoarrow-rs](https://github.com/geoarrow/geoarrow-rs).
Mimicking pyarrow's API, I currently have a [`ParquetFile`
struct](https://github.com/geoarrow/geoarrow-rs/blob/8a9385eeeebe434ab49efbae830666e3a3997f6a/src/io/parquet/reader/async.rs#L69-L74),
which is backed by a single `R: AsyncFileReader`, as well as a
[`ParquetDataset`
struct](https://github.com/geoarrow/geoarrow-rs/blob/8a9385eeeebe434ab49efbae830666e3a3997f6a/src/io/parquet/reader/async.rs#L263-L267),
which is backed by `Vec<ParquetFile<R>>, where R: AsyncFileReader`. This
allows concurrent async reads across multiple files.

I'd like to have a `ParquetDataset::from_metadata` method, which constructs
itself from a `_metadata` file. But to do that I need to be able to construct
`ArrowReaderMetadata` for each underlying file. This is entirely possible with
existing APIs, except that `ArrowReaderMetadata::try_new` has visibility
`pub(crate)`.

**Describe the solution you'd like**

Give `ArrowReaderMetadata::try_new` full public visibility.

**Describe alternatives you've considered**

Unsure of alternatives.

**Additional context**

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Create `ArrowReaderMetadata` from externalized metadata [arrow-rs]

Reply via email to