[Python] Why does reading an arrow file cause almost double the memory consumption?

Arun Joseph Fri, 03 Dec 2021 12:54:14 -0800

Hi Apache Arrow Members,

My question is below but I've compiled a minimum reproducible example with
a public dataset:


import pandas as pd
from pyarrow import feather
import os
import psutil


def setup():
  df = pd.read_csv('
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
')
  df.to_feather('test.csv')

if __name__ == "__main__":
  # setup()
  process = psutil.Process(os.getpid())
  path = 'test.csv'

  mem_size = process.memory_info().rss / 1e9
  print(f'BEFORE mem_size: {mem_size}gb')

  df = feather.read_feather(path)

  df_size = df.memory_usage(deep=True).sum() / 1e9
  mem_size = process.memory_info().rss / 1e9
  print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')

I substituted my df with a sample csv. I had trouble finding a sample CSV
of adequate size however, my dataset is ~3GB, and I see memory usage of
close to 6GB.

Output with My Data:
BEFORE mem_size: 0.088891392gb
AFTER mem_size: 6.324678656gb df_size: 3.080121688gb

It seems strange that the overall memory usage of the process is approx
double of the size of the dataframe itself. Is there a reason for this, and
is there a way to mitigate this?

$ conda list pyarrow
#
# Name                    Version                   Build  Channel
pyarrow                   4.0.1           py37h0f64622_13_cpu    conda-forge

Thank You,
Arun Joseph

[Python] Why does reading an arrow file cause almost double the memory consumption?

Reply via email to