pitrou commented on issue #45287:
URL: https://github.com/apache/arrow/issues/45287#issuecomment-2607856045
> In that dataset I observed that the length of the metadata region in one
of the .parquet files is 1082066 bytes, and since the metadata region is read
in full, the reader needs to read ~120 bytes of metadata-region-data per data
value -- so I think it would be expected if there's some memory usage overhead
because of this
Yes, unfortunately with the current version of the Parquet format it's
difficult to avoid that overhead.
There are discussions in the Parquet community about a redesign of the
Parquet metadata to precisely avoid the issue of metadata loading overhead with
very wide schemas. Some preliminary proof of concept gave encouraging results,
but the whole project will need pushing forward with actual specs and
implementations.
> However I think what our main concern is is that the memory usage doesn't
seem to be constant -- it constantly increases and isn't freed after the read
is done
When you say it isn't freed, how does your use case look exactly? Do you:
* reuse the same dataset always to read different rows and/or columns?
* dispose the dataset and create a new one for each read?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]