Hi guys,

I'm getting this weird memory usage info when I tried to start using
pyarrow to read a parquet file.

I wrote a simple script to show how much memory is consumed after each step.
the result is illustrated in the table:

row number pa.total_allocated_bytes memory usage by psutil
without filters 5131100 177M 323M
with field filter 57340 2041K 323M
with column pruning 5131100 48M 154M
with both field filter and column pruning 57340 567K 204M

the weird part is: the total memory usage when I apply both field filter
and column pruning is *larger* than only column pruning applied.

I don't know how that happened, do you guys know the reason for this?

thanks.

env info:

platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.10
distro info: ('Ubuntu', '20.04', 'focal')
pyarrow: 6.0.1


script code:

import pyarrow as pa
import psutil
import os
import pyarrow.dataset as ds

pid = os.getpid()

def show_mem(action: str) -> None:
    mem = psutil.Process(pid).memory_info()[0] >> 20
    print(f"******* memory usage after {action} **********")
    print(f"*                   {mem}M                    *")
    print(f"**********************************************")

dataset = ds.dataset("tmp/uber.parquet", format="parquet")
show_mem("read dataset")
projection = {
    "Dispatching_base_num": ds.field("Dispatching_base_num")
}
filter = ds.field("locationID") == 100
table = dataset.to_table(
    filter=filter,
    columns=projection
    )
print(f"table row number: {table.num_rows}")
print(f"total bytes: {pa.total_allocated_bytes() >> 10}K")
show_mem("dataset.to_table")

Reply via email to