Re: [I] [C++] Metadata related memory leak when reading parquet dataset [arrow]

via GitHub Wed, 22 Jan 2025 09:35:51 -0800


pitrou commented on issue #45287:
URL: https://github.com/apache/arrow/issues/45287#issuecomment-2607856045


   > In that dataset I observed that the length of the metadata region in one 
of the .parquet files is 1082066 bytes, and since the metadata region is read 
in full, the reader needs to read ~120 bytes of metadata-region-data per data 
value -- so I think it would be expected if there's some memory usage overhead 
because of this
   
   Yes, unfortunately with the current version of the Parquet format it's 
difficult to avoid that overhead.
   
   There are discussions in the Parquet community about a redesign of the 
Parquet metadata to precisely avoid the issue of metadata loading overhead with 
very wide schemas. Some preliminary proof of concept gave encouraging results, 
but the whole project will need pushing forward with actual specs and 
implementations.
   
   > However I think what our main concern is is that the memory usage doesn't 
seem to be constant -- it constantly increases and isn't freed after the read 
is done
   
   When you say it isn't freed, how does your use case look exactly? Do you:
   * reuse the same dataset always to read different rows and/or columns?
   * dispose the dataset and create a new one for each read?
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++] Metadata related memory leak when reading parquet dataset [arrow]

Reply via email to