[GitHub] [arrow] westonpace commented on issue #11043: [C++] Memory Usage of Parquet Stream Writer

GitBox Thu, 02 Sep 2021 13:33:04 -0700


westonpace commented on issue #11043:
URL: https://github.com/apache/arrow/issues/11043#issuecomment-912034302



   The short, though maybe unsatisfying, answer is that it is row group 
information and it is expected.  The parquet file writer is going to keep some 
data in memory for each row group that is created.  This is used to populate 
the footer of the parquet file.  To verify this simply rerun your experiment 
with 1 or 10 rows per row group and you will see the exact same memory usage 
patterns.
   
   One thing to note is that you are creating 400 columns even if you are only 
using 100 of them.  This makes the row group metadata about 4x larger than it 
needs to be.
   
   There is probably room for improvement as to what information we really need 
to keep.  I don't know that any serious optimization passes have been done on 
this front.  I believe the answer is usually to use large row groups and 
multiple files if needed to prevent ever creating a single file with too much 
row group metadata information.
   
   > The used memory seems to roughly double every time it grows. This 
indicates that the memory is allocated by a std::vector or a container type 
which has the same exponential growth behavior.
   
   This is probably just an artifact of the way you are measuring memory usage. 
 The VSZ/RSS numbers tend to grow in chunks because malloc will try and reduce 
the # of allocations it needs to make.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #11043: [C++] Memory Usage of Parquet Stream Writer

Reply via email to