[pyarrow] How can one handle parquet encoding memory bombs

Rollo Konig-Brock Wed, 29 Jan 2020 07:46:09 -0800

Dear Arrow Developers,

I’m having memory issues with certain Parquet files. Parquet uses run length 
encoding for certain columns, meaning that an array of int64s could take up a 
couple thousand bytes on disk, but then a few hundred megabytes when loaded 
into a pyarrow.Table. I’m a little surprised there’s no option to keep the 
underlying Arrow array as encoded data.


An example of this with a file I’ve created (attached) which is just 1’000’000 
repetitions of a single int64 value:

import psutil
import os
import gc
import pyarrow.parquet

suffixes = ['B', 'KB', 'MB', 'GB', 'TB', 'PB']
current_thread = psutil.Process(os.getpid())

def human_memory_size(nbytes):
    i = 0
    while nbytes >= 1024 and i < len(suffixes)-1:
        nbytes /= 1024.
        i += 1
    f = ('%.2f' % nbytes).rstrip('0').rstrip('.')
    return '%s %s' % (f, suffixes[i])


def log_memory_usage(msg):
    print(msg, human_memory_size(current_thread.memory_info().rss))

log_memory_usage('Initial Memory usage')

print('Size of parquet file', 
human_memory_size(os.stat('rle_bomb.parquet').st_size))

pf = pyarrow.parquet.ParquetFile('rle_bomb.parquet')

table = pf.read()

log_memory_usage('Loaded memory usage')

This will produce the following output:

Initial Memory usage 27.11 MB
Size of parquet file 3.62 KB
Loaded schema 27.71 MB
Loaded memory usage 997.9 MB

This poses a bit of a problem particularly when running this code in servers as 
there doesn’t seem to be a way of preventing a memory explosion given the 
PyArrow API. I’m at a bit of a loss at how to control for this, there does not 
seem to be a method to do something like iterate over the Parquet columns in 
set chunks (where the size could be calculated accurately).

All the best,
Rollo Konig-Brock

[pyarrow] How can one handle parquet encoding memory bombs

Reply via email to