Dear Arrow Developers,
I’m having memory issues with certain Parquet files. Parquet uses run length
encoding for certain columns, meaning that an array of int64s could take up a
couple thousand bytes on disk, but then a few hundred megabytes when loaded
into a pyarrow.Table. I’m a little surprised there’s no option to keep the
underlying Arrow array as encoded data.
An example of this with a file I’ve created (attached) which is just 1’000’000
repetitions of a single int64 value:
import psutil
import os
import gc
import pyarrow.parquet
suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
current_thread = psutil.Process(os.getpid())
def human_memory_size(nbytes):
i = 0
while nbytes >= 1024 and i < len(suffixes)-1:
nbytes /= 1024.
i += 1
f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
return '%s %s' % (f, suffixes[i])
def log_memory_usage(msg):
print(msg, human_memory_size(current_thread.memory_info().rss))
log_memory_usage('Initial Memory usage')
print('Size of parquet file',
human_memory_size(os.stat('rle_bomb.parquet').st_size))
pf = pyarrow.parquet.ParquetFile('rle_bomb.parquet')
table = pf.read()
log_memory_usage('Loaded memory usage')
This will produce the following output:
Initial Memory usage 27.11 MB
Size of parquet file 3.62 KB
Loaded schema 27.71 MB
Loaded memory usage 997.9 MB
This poses a bit of a problem particularly when running this code in servers as
there doesn’t seem to be a way of preventing a memory explosion given the
PyArrow API. I’m at a bit of a loss at how to control for this, there does not
seem to be a method to do something like iterate over the Parquet columns in
set chunks (where the size could be calculated accurately).
All the best,
Rollo Konig-Brock