Re: [I] A 175M DataFrame saved to parquet requires 1G of memory to be read [arrow]

via GitHub Tue, 17 Oct 2023 10:19:45 -0700


jorisvandenbossche commented on issue #38245:
URL: https://github.com/apache/arrow/issues/38245#issuecomment-1766848931


   I have to admit that it indeed looks like a lot of memory for what is being 
read.
   
   First, what is a bit confusing about this case is that it is a quite 
sub-optimal Parquet file: it's 126 MB on disk, while being 170 MB in memory 
when loaded into Arrow. In many cases that's a much bigger difference. In this 
case, though, 1) you have many columns and not many rows (which gives some more 
metadata overhead), and 2) you have a range of integers, which get dictionary 
encoded, but the integers being all unique, this encoding doesn't actually help 
at all (typically, if you have this kind of data and more rows, we will stop 
dictionary encoding at some point, but here this is done for every column, 
because you have a lot of columns with relatively few rows).  
   If I disable dictionary encoding, the file size reduces to 94 MB 
(`use_dictionary=False` in pq.write_table), and when enabling 
DELTA_BINARY_PACKED as @mapleFU mentioned above, it reduces to a small file of 
3 MB (`use_dictionary=False, column_encoding="DELTA_BINARY_PACKED"`). 
   
   Now, even for the dictionary encoded file of 126MB, it's still strange that 
this needs a peak memory usage of 1 GB (which I can reproduce, also measuring 
with memray). 
   If we naively count the different parts that are needed, you get loading the 
bytes of the file itself (126MB), uncompressing it (201 MB), 
deserialzing/decoding it, and creating the actual arrow memory (170 MB):
   
   ```python
   >>> meta = pq.read_metadata("myfile.parquet")
   >>> meta.row_group(0).total_byte_size / 1024**2
   201.57076263427734
   >>> table = pq.read_table("myfile.parquet")
   >>> table.nbytes / 1024**2
   169.7307472229004
   ```
   
   So naively summing those gives around 500 MB, so from that point of view it 
is strange that decoding the dictionary encoded parquet data requires that much 
of additional memory.
   
   Further note is that those files without dictionary encoding and with 
DELTA_BINARY_PACKED encoding also require less peak memory to read into an 
arrow table: I see a reported peak memory of 734MB and 384MB, respectively.
   
   (tested with the following files)
   ```python
   pq.write_table(table, "test_myfile_plain.parquet", use_dictionary=False)
   pq.write_table(table, "test_myfile_delta.parquet",  use_dictionary=False, 
column_encoding="DELTA_BINARY_PACKED")
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] A 175M DataFrame saved to parquet requires 1G of memory to be read [arrow]

Reply via email to