mrbrahman commented on issue #40958:
URL: https://github.com/apache/arrow/issues/40958#issuecomment-2050501230
@wgtmac, no I don't think _metadata file would be widely used in the bigdata
systems like Hadoop/Spark etc. However, with Apache Arrow it does seem to have
the required API (in ParquetFile) to read metadata separately from the data.
Of course, I'm also not sure if Apache Arrow will also specifically support
the below section from my request (because we currently have no way to stitch 2
metadata files):
> One this is done, a combined data can be created using:
> ~~~python
> m = pq.read_metadata('_metadata')
> data = pq.ParquetFile('file1.parquet', 'file2.parquet', metadata=m)
>
> # data should now be able to show all columns
> ~~~
where file1.parquet contains col1, col2, co3 and file2.parquet contains col4
and co5 (different set of columns). Here only the the _metadata file has the
overarching information about the 'table' definition.
I'm only guessing that it would be supported, since it has the API to do so.
However, it would be nice to confirm that as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]