westonpace commented on issue #11043: URL: https://github.com/apache/arrow/issues/11043#issuecomment-912034302
The short, though maybe unsatisfying, answer is that it is row group information and it is expected. The parquet file writer is going to keep some data in memory for each row group that is created. This is used to populate the footer of the parquet file. To verify this simply rerun your experiment with 1 or 10 rows per row group and you will see the exact same memory usage patterns. One thing to note is that you are creating 400 columns even if you are only using 100 of them. This makes the row group metadata about 4x larger than it needs to be. There is probably room for improvement as to what information we really need to keep. I don't know that any serious optimization passes have been done on this front. I believe the answer is usually to use large row groups and multiple files if needed to prevent ever creating a single file with too much row group metadata information. > The used memory seems to roughly double every time it grows. This indicates that the memory is allocated by a std::vector or a container type which has the same exponential growth behavior. This is probably just an artifact of the way you are measuring memory usage. The VSZ/RSS numbers tend to grow in chunks because malloc will try and reduce the # of allocations it needs to make. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org