Hi Wes,
I'm not including the setup() call when I encounter the issue. I just kept
it in there for ease of reproducibility. Memory usage is indeed higher when
it is included, but that isn't surprising.
I tried switching over to the system allocator but there is no change.
I've updated to Arrow 6.0.1 as well and there is no change.
I updated my script to also include the Arrow bytes allocated and it gave
me the following:
MVE:
import pandas as pd
import pyarrow as pa
from pyarrow import feather
import os
import psutil
pa.set_memory_pool(pa.system_memory_pool())
def setup():
df = pd.read_csv('
https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
')
df.to_feather('test.csv')
if __name__ == "__main__":
# setup()
process = psutil.Process(os.getpid())
path = 'test.csv'
mem_size = process.memory_info().rss / 1e9
print(f'BEFORE mem_size: {mem_size}gb')
df = feather.read_feather(path)
df_size = df.memory_usage(deep=True).sum() / 1e9
mem_size = process.memory_info().rss / 1e10
print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
Output with my data:
BEFORE mem_size: 0.08761344gb
AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
ARROW: 3.080121792gb
Output with Provided Setup Data:
BEFORE mem_size: 0.09179136gb
AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
ARROW: 0.00029664gb
I'm assuming that the df and the arrow bytes allocated/sizes are distinct
and non-overlapping, but it seems strange that the output with the provided
data has the Arrow bytes allocated at ~0GB whereas the one with my data has
the allocated data approximately equal to the dataframe size. I'm not sure
if it affects anything but my file was written with the following:
import pyarrow.lib as ext
import pyarrow
COMPRESSION_LEVEL = 19
COMPRESSION_ALGO = 'zstd'
KILOBYTE = 1 << 10
MEGABYTE = KILOBYTE * KILOBYTE
CHUNK_SIZE = MEGABYTE
table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
ext.write_feather(table, dest, compression=compression, compression_level=
compression_level,chunksize=chunk_size, version=2)
As to the discrepancy around calculating dataframe size. I'm not sure why
that would be so off for you. Going off the docs
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html>,
it seems like it should be accurate. My Dataframe in question is [7196546
rows x 56 columns] where each column is mostly a float or integer and
datetime index. 7196546 * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
Thank You,
Arun
On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <[email protected]> wrote:
> 2x overshoot of memory does seem a little high. Are you including the
> "setup" part when you encounter that? Arrow's file-based CSV reader
> will require 2-3x memory usage because it buffers the bytes in memory
> in case it needs to re-convert them later (because it realizes the
> data type for the column is different). I'm not sure if Panda's CSV
> reader is similar.
>
> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
> to a bit more memory and hold onto it (for a little while at least)
> even after it is no longer used. Even malloc will hold onto memory
> sometimes due to fragmentation or other concerns. You could try
> changing to the system allocator
> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your file)
> to see if that makes a difference.
>
> I'm not sure your method of calculating the dataframe size is
> reliable. I don't actually know enough about pandas but when I tried
> your experiment with my own 1.9G CSV file it ended up reporting:
>
> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>
> which seems suspicious.
>
> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
> all that unexpected. There was 2.348GB of usage. Arrow itself was
> only using ~1.9GB and I will naively assume the difference between the
> two is bloat caused by object wrappers when converting to pandas.
>
> Another thing you might try and measure is
> `pa.default_memory_pool().bytes_allocated()`. This will tell you how
> much memory Arrow itself is hanging onto. If that is not 6GB then it
> is a pretty good guess that memory is being held somewhere else.
>
> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <[email protected]> wrote:
> >
> > Hi Apache Arrow Members,
> >
> > My question is below but I've compiled a minimum reproducible example
> with a public dataset:
> >
> > import pandas as pd
> > from pyarrow import feather
> > import os
> > import psutil
> >
> >
> > def setup():
> > df = pd.read_csv('
> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
> ')
> > df.to_feather('test.csv')
> >
> > if __name__ == "__main__":
> > # setup()
> > process = psutil.Process(os.getpid())
> > path = 'test.csv'
> >
> > mem_size = process.memory_info().rss / 1e9
> > print(f'BEFORE mem_size: {mem_size}gb')
> >
> > df = feather.read_feather(path)
> >
> > df_size = df.memory_usage(deep=True).sum() / 1e9
> > mem_size = process.memory_info().rss / 1e9
> > print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
> >
> > I substituted my df with a sample csv. I had trouble finding a sample
> CSV of adequate size however, my dataset is ~3GB, and I see memory usage of
> close to 6GB.
> >
> > Output with My Data:
> > BEFORE mem_size: 0.088891392gb
> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
> >
> > It seems strange that the overall memory usage of the process is approx
> double of the size of the dataframe itself. Is there a reason for this, and
> is there a way to mitigate this?
> >
> > $ conda list pyarrow
> > #
> > # Name Version Build Channel
> > pyarrow 4.0.1 py37h0f64622_13_cpu
> conda-forge
> >
> > Thank You,
> > Arun Joseph
> >
>
--
Arun Joseph