2x overshoot of memory does seem a little high.  Are you including the
"setup" part when you encounter that?  Arrow's file-based CSV reader
will require 2-3x memory usage because it buffers the bytes in memory
in case it needs to re-convert them later (because it realizes the
data type for the column is different).  I'm not sure if Panda's CSV
reader is similar.

Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
to a bit more memory and hold onto it (for a little while at least)
even after it is no longer used.  Even malloc will hold onto memory
sometimes due to fragmentation or other concerns.  You could try
changing to the system allocator
(pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
to see if that makes a difference.

I'm not sure your method of calculating the dataframe size is
reliable.  I don't actually know enough about pandas but when I tried
your experiment with my own 1.9G CSV file it ended up reporting:

AFTER mem_size: 2.348068864gb df_size: 4.519898461gb

which seems suspicious.

Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
all that unexpected.  There was 2.348GB of usage.  Arrow itself was
only using ~1.9GB and I will naively assume the difference between the
two is bloat caused by object wrappers when converting to pandas.

Another thing you might try and measure is
`pa.default_memory_pool().bytes_allocated()`.  This will tell you how
much memory Arrow itself is hanging onto.  If that is not 6GB then it
is a pretty good guess that memory is being held somewhere else.

On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <[email protected]> wrote:
>
> Hi Apache Arrow Members,
>
> My question is below but I've compiled a minimum reproducible example with a 
> public dataset:
>
> import pandas as pd
> from pyarrow import feather
> import os
> import psutil
>
>
> def setup():
>   df = 
> pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
>   df.to_feather('test.csv')
>
> if __name__ == "__main__":
>   # setup()
>   process = psutil.Process(os.getpid())
>   path = 'test.csv'
>
>   mem_size = process.memory_info().rss / 1e9
>   print(f'BEFORE mem_size: {mem_size}gb')
>
>   df = feather.read_feather(path)
>
>   df_size = df.memory_usage(deep=True).sum() / 1e9
>   mem_size = process.memory_info().rss / 1e9
>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>
> I substituted my df with a sample csv. I had trouble finding a sample CSV of 
> adequate size however, my dataset is ~3GB, and I see memory usage of close to 
> 6GB.
>
> Output with My Data:
> BEFORE mem_size: 0.088891392gb
> AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>
> It seems strange that the overall memory usage of the process is approx 
> double of the size of the dataframe itself. Is there a reason for this, and 
> is there a way to mitigate this?
>
> $ conda list pyarrow
> #
> # Name                    Version                   Build  Channel
> pyarrow                   4.0.1           py37h0f64622_13_cpu    conda-forge
>
> Thank You,
> Arun Joseph
>

Reply via email to