Re: [I] A 175M DataFrame saved to parquet requires 1G of memory to be read [arrow]

via GitHub Tue, 17 Oct 2023 10:42:32 -0700


mapleFU commented on issue #38245:
URL: https://github.com/apache/arrow/issues/38245#issuecomment-1766879637


   > Now, even for the dictionary encoded file of 126MB, it's still strange 
that this needs a peak memory usage of 1 GB (which I can reproduce, also 
measuring with memray).
   If we naively count the different parts that are needed, you get loading the 
bytes of the file itself (126MB), uncompressing it (201 MB), 
deserialzing/decoding it, and creating the actual arrow memory (170 MB)
   
   Considering (1) in 
https://github.com/apache/arrow/issues/38245#issuecomment-1766857566 . We still 
need a `Buffer` for each column. See 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1781
   
   Also, the row-number is 5000, so, for INT64, it would be 40000bytes. 
Besides, the `ResizableBuffer` might allocate more memory than actually 
required. This will make memory grows larger


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] A 175M DataFrame saved to parquet requires 1G of memory to be read [arrow]

Reply via email to