Hi Apache Arrow Members,
My question is below but I've compiled a minimum reproducible example with
a public dataset:
import pandas as pd
from pyarrow import feather
import os
import psutil
def setup():
df = pd.read_csv('
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
')
df.to_feather('test.csv')
if __name__ == "__main__":
# setup()
process = psutil.Process(os.getpid())
path = 'test.csv'
mem_size = process.memory_info().rss / 1e9
print(f'BEFORE mem_size: {mem_size}gb')
df = feather.read_feather(path)
df_size = df.memory_usage(deep=True).sum() / 1e9
mem_size = process.memory_info().rss / 1e9
print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
I substituted my df with a sample csv. I had trouble finding a sample CSV
of adequate size however, my dataset is ~3GB, and I see memory usage of
close to 6GB.
Output with My Data:
BEFORE mem_size: 0.088891392gb
AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
It seems strange that the overall memory usage of the process is approx
double of the size of the dataframe itself. Is there a reason for this, and
is there a way to mitigate this?
$ conda list pyarrow
#
# Name Version Build Channel
pyarrow 4.0.1 py37h0f64622_13_cpu conda-forge
Thank You,
Arun Joseph