mapleFU commented on issue #38245: URL: https://github.com/apache/arrow/issues/38245#issuecomment-1766879637
> Now, even for the dictionary encoded file of 126MB, it's still strange that this needs a peak memory usage of 1 GB (which I can reproduce, also measuring with memray). If we naively count the different parts that are needed, you get loading the bytes of the file itself (126MB), uncompressing it (201 MB), deserialzing/decoding it, and creating the actual arrow memory (170 MB) Considering (1) in https://github.com/apache/arrow/issues/38245#issuecomment-1766857566 . We still need a `Buffer` for each column. See https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc#L1781 Also, the row-number is 5000, so, for INT64, it would be 40000bytes. Besides, the `ResizableBuffer` might allocate more memory than actually required. This will make memory grows larger -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
