timothydijamco commented on issue #45287:
URL: https://github.com/apache/arrow/issues/45287#issuecomment-2607664288

   Thanks for helping look into this
   
   > > Yeah this is a extreme case just to show the repro. In practice the file 
has a couple thousands row per file.
   > 
   > How many row groups per file (or rows per row group)? It turns out much of 
the Parquet metadata consumption is in ColumnChunk entries. A 
Thrift-deserialized ColumnChunk is 640 bytes long, and there are O(CRF) 
ColumnChunks in your dataset, with C=number_columns, 
R=number_row_groups_per_file and F=number_files.
   
   We typically use one row group per file
   
   For some additional background, one of the situations where we originally 
observed high memory usage is this:
   * Dataset has ~3000 rows per row group (and per file) and 5000 columns
   * User is reading 3 columns
   
   In that dataset I observed that the length of the metadata region in one of 
the .parquet files is 1082066 bytes, and since the metadata region is read in 
full, the reader needs to read ~120 bytes of metadata-region-data per data 
value -- so I think it would be expected if there's some memory usage overhead 
because of this. However I think what our main concern is is that the memory 
usage doesn't seem to be constant -- it constantly increases and isn't freed 
after the read is done
   
   
   > Hmm, this needs clarifying a bit then :) What do the memory usage numbers 
you posted represent? Is it peak memory usage? Is it memory usage after loading 
the dataset as a Arrow table? Is the dataset object still alive at that point?
   
   I think it's peak memory usage after loading the table into an Arrow Table. 
However, I'm not sure about whether the dataset object being alive or not. I'll 
work on a C++ repro and share here
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to