jorisvandenbossche commented on issue #38245:
URL: https://github.com/apache/arrow/issues/38245#issuecomment-1766848931
I have to admit that it indeed looks like a lot of memory for what is being
read.
First, what is a bit confusing about this case is that it is a quite
sub-optimal Parquet file: it's 126 MB on disk, while being 170 MB in memory
when loaded into Arrow. In many cases that's a much bigger difference. In this
case, though, 1) you have many columns and not many rows (which gives some more
metadata overhead), and 2) you have a range of integers, which get dictionary
encoded, but the integers being all unique, this encoding doesn't actually help
at all (typically, if you have this kind of data and more rows, we will stop
dictionary encoding at some point, but here this is done for every column,
because you have a lot of columns with relatively few rows).
If I disable dictionary encoding, the file size reduces to 94 MB
(`use_dictionary=False` in pq.write_table), and when enabling
DELTA_BINARY_PACKED as @mapleFU mentioned above, it reduces to a small file of
3 MB (`use_dictionary=False, column_encoding="DELTA_BINARY_PACKED"`).
Now, even for the dictionary encoded file of 126MB, it's still strange that
this needs a peak memory usage of 1 GB (which I can reproduce, also measuring
with memray).
If we naively count the different parts that are needed, you get loading the
bytes of the file itself (126MB), uncompressing it (201 MB),
deserialzing/decoding it, and creating the actual arrow memory (170 MB):
```python
>>> meta = pq.read_metadata("myfile.parquet")
>>> meta.row_group(0).total_byte_size / 1024**2
201.57076263427734
>>> table = pq.read_table("myfile.parquet")
>>> table.nbytes / 1024**2
169.7307472229004
```
So naively summing those gives around 500 MB, so from that point of view it
is strange that decoding the dictionary encoded parquet data requires that much
of additional memory.
Further note is that those files without dictionary encoding and with
DELTA_BINARY_PACKED encoding also require less peak memory to read into an
arrow table: I see a reported peak memory of 734MB and 384MB, respectively.
(tested with the following files)
```python
pq.write_table(table, "test_myfile_plain.parquet", use_dictionary=False)
pq.write_table(table, "test_myfile_delta.parquet", use_dictionary=False,
column_encoding="DELTA_BINARY_PACKED")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]