Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Arun Joseph Tue, 07 Dec 2021 13:54:54 -0800

Slightly related, I have some other code that opens up an arrow file using
a `pyarrow.ipc.RecordBatchFileReader` and then converts RecordBatch to a
pandas dataframe. After this conversion is done, and I inspect the heap, I
always see the following:


hpy().heap()
Partition of a set of 351136 objects. Total size = 20112096840 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0    121   0 9939601034  49 9939601034  49 numpy.ndarray
     1      1   0 9939585700  49 19879186734  99 pandas.core.frame.DataFrame
     2      1   0 185786680   1 20064973414 100
pandas.core.indexes.datetimes.DatetimeIndex

Specifically the numpy.ndarray. It only shows up after the conversion and
it does not seem to go away. It also seems to be roughly the same size as
the dataframe itself.

- Arun

On Tue, Dec 7, 2021 at 10:21 AM Arun Joseph <[email protected]> wrote:

> Just to follow up on this, is there a way to manually force the arrow pool
> to de-allocate? My usecase is essentially having multiple processes in a
> Pool or via Slurm read from an arrow file, do some work, and then exit.
> Issue is that the 2x memory consumption reduces the bandwidth on the
> machine to effectively half.
>
> Thank You,
> Arun
>
> On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <[email protected]> wrote:
>
>> Additionally, I tested with my actual data, and did not see memory
>> savings.
>>
>> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <[email protected]> wrote:
>>
>>> Hi Joris,
>>>
>>> Thank you for the explanation. The 2x memory consumption on conversion
>>> makes sense if there is a copy, but it does seem like it persists longer
>>> than it should. Might be because of python's GC policies?
>>> I tried out your recommendations but they did not seem to work. However,
>>> I did notice an experimental option on `to_pandas`, `self_destruct`, which
>>> seems to address the issue I'm facing. Sadly, that itself did not work
>>> either... but, combined with `split_blocks=True`, I am seeing memory
>>> savings:
>>>
>>> import pandas as pd
>>> import numpy as np
>>> import pyarrow as pa
>>> from pyarrow import feather
>>> import os
>>> import psutil
>>> pa.set_memory_pool(pa.system_memory_pool())
>>> DATA_FILE = 'test.arrow'
>>>
>>> def setup():
>>>   np.random.seed(0)
>>>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)),
>>> columns=list([f'{i}' for i in range(57)]))
>>>   df.to_feather(DATA_FILE)
>>>   print(f'wrote {DATA_FILE}')
>>>   import sys
>>>   sys.exit()
>>>
>>> if __name__ == "__main__":
>>>   # setup()
>>>   process = psutil.Process(os.getpid())
>>>   path = DATA_FILE
>>>
>>>   mem_size = process.memory_info().rss / 1e9
>>>   print(f'BEFORE mem_size: {mem_size}gb')
>>>
>>>   feather_table = feather.read_table(path)
>>>   # df = feather_table.to_pandas(split_blocks=True)
>>>   # df = feather_table.to_pandas()
>>>   df = feather_table.to_pandas(self_destruct=True, split_blocks=True)
>>>
>>>   mem_size = process.memory_info().rss / 1e9
>>>   df_size = df.memory_usage().sum() / 1e9
>>>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>>>
>>>
>>> OUTPUT(to_pandas()):
>>> BEFORE mem_size: 0.091795456gb
>>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
>>> ARROW: 3.281625024gb
>>>
>>> OUTPUT (to_pandas(split_blocks=True)):
>>> BEFORE mem_size: 0.091795456gb
>>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
>>> ARROW: 3.281627712gb
>>>
>>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
>>> BEFORE mem_size: 0.091795456gb
>>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
>>> ARROW: 3.281627712gb
>>>
>>> I'm guessing since this feature is experimental, it might either go
>>> away, or might have strange behaviors. Is there anything I should look out
>>> for, or is there some alternative to reproduce these results?
>>>
>>> Thank You,
>>> Arun
>>>
>>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <
>>> [email protected]> wrote:
>>>
>>>> Hi Aron, Weston,
>>>>
>>>> I didn't try running the script locally, but a quick note: the
>>>> `feather.read_feather` function reads the Feather file into an Arrow
>>>> table ànd directly converts it to a pandas DataFrame. A memory
>>>> consumption 2x the size of the dataframe sounds not that unexpected to
>>>> me: most of the time, when converting an arrow table to a pandas
>>>> DataFrame, the data will be copied to accommodate for pandas' specific
>>>> internal memory layout (at least numeric columns will be combined
>>>> together in 2D arrays).
>>>>
>>>> To verify if this is the cause, you might want to do either of:
>>>> - use `feather.read_table` instead of `feather.read_feather`, which
>>>> will read the file as an Arrow table instead (and don't do any
>>>> conversion to pandas)
>>>> - if you want to include the conversion to pandas, also use
>>>> `read_table` and do the conversion to pandas explicitly with a
>>>> `to_pandas()` call on the result. In that case, you can specify
>>>> `split_blocks=True` to use more zero-copy conversion in the
>>>> arrow->pandas conversion
>>>>
>>>> Joris
>>>>
>>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <[email protected]> wrote:
>>>> >
>>>> > Hi Wes,
>>>> >
>>>> > Sorry for the late reply on this, but I think I got a reproducible
>>>> test case:
>>>> >
>>>> > import pandas as pd
>>>> > import numpy as np
>>>> > import pyarrow as pa
>>>> > from pyarrow import feather
>>>> > import os
>>>> > import psutil
>>>> > pa.set_memory_pool(pa.system_memory_pool())
>>>> > DATA_FILE = 'test.arrow'
>>>> >
>>>> > def setup():
>>>> >   np.random.seed(0)
>>>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)),
>>>> columns=list([f'i_{i}' for i in range(57)]))
>>>> >   df.to_feather(DATA_FILE)
>>>> >   print(f'wrote {DATA_FILE}')
>>>> >   import sys
>>>> >   sys.exit()
>>>> >
>>>> > if __name__ == "__main__":
>>>> >   # setup()
>>>> >   process = psutil.Process(os.getpid())
>>>> >   path = DATA_FILE
>>>> >
>>>> >   mem_size = process.memory_info().rss / 1e9
>>>> >   print(f'BEFORE mem_size: {mem_size}gb')
>>>> >
>>>> >   df = feather.read_feather(path)
>>>> >
>>>> >   mem_size = process.memory_info().rss / 1e9
>>>> >   df_size = df.memory_usage().sum() / 1e9
>>>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
>>>> 1e9}gb')
>>>> >
>>>> > OUTPUT:
>>>> > BEFORE mem_size: 0.091795456gb
>>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
>>>> > ARROW: 3.281625024gb
>>>> >
>>>> > Let me know if you're able to see similar results.
>>>> >
>>>> > Thanks,
>>>> > Arun
>>>> >
>>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> I get more or less the same results as you for the provided setup
>>>> data
>>>> >> (exact same #'s for arrow & df_size and slightly different for RSS
>>>> >> which is to be expected).  The fact that the arrow size is much lower
>>>> >> than the dataframe size is not too surprising to me.  If a column
>>>> >> can't be zero copied then it's memory will disappear from the arrow
>>>> >> pool (I think).  Plus, object columns will have overhead in pandas
>>>> >> that they do not have in Arrow.
>>>> >>
>>>> >> The df_size issue for me seems to be tied to string columns.  I think
>>>> >> pandas is overestimating how much size is needed there (many of my
>>>> >> strings are similar and I wonder if some kind of object sharing is
>>>> >> happening).  But we can table this for another time.
>>>> >>
>>>> >> I tried writing my feather file with your parameters and it didn't
>>>> >> have much impact on any of the numbers.
>>>> >>
>>>> >> Since the arrow size for you is expected (nearly the same as the
>>>> >> df_size) I'm not sure what to investigate next.  The memory does not
>>>> >> seem to be retained by Arrow.  Is there any chance you could create a
>>>> >> reproducible test case using randomly generated numpy data (then you
>>>> >> could share that setup function)?
>>>> >>
>>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <[email protected]>
>>>> wrote:
>>>> >> >
>>>> >> > Hi Wes,
>>>> >> >
>>>> >> > I'm not including the setup() call when I encounter the issue. I
>>>> just kept it in there for ease of reproducibility. Memory usage is indeed
>>>> higher when it is included, but that isn't surprising.
>>>> >> >
>>>> >> > I tried switching over to the system allocator but there is no
>>>> change.
>>>> >> >
>>>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
>>>> >> >
>>>> >> > I updated my script to also include the Arrow bytes allocated and
>>>> it gave me the following:
>>>> >> >
>>>> >> > MVE:
>>>> >> > import pandas as pd
>>>> >> > import pyarrow as pa
>>>> >> > from pyarrow import feather
>>>> >> > import os
>>>> >> > import psutil
>>>> >> > pa.set_memory_pool(pa.system_memory_pool())
>>>> >> >
>>>> >> >
>>>> >> > def setup():
>>>> >> >   df = pd.read_csv('
>>>> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
>>>> ')
>>>> >> >   df.to_feather('test.csv')
>>>> >> >
>>>> >> > if __name__ == "__main__":
>>>> >> >   # setup()
>>>> >> >   process = psutil.Process(os.getpid())
>>>> >> >   path = 'test.csv'
>>>> >> >
>>>> >> >   mem_size = process.memory_info().rss / 1e9
>>>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>>>> >> >
>>>> >> >   df = feather.read_feather(path)
>>>> >> >
>>>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>>>> >> >   mem_size = process.memory_info().rss / 1e10
>>>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
>>>> 1e9}gb')
>>>> >> >
>>>> >> > Output with my data:
>>>> >> > BEFORE mem_size: 0.08761344gb
>>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
>>>> >> > ARROW: 3.080121792gb
>>>> >> >
>>>> >> > Output with Provided Setup Data:
>>>> >> > BEFORE mem_size: 0.09179136gb
>>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
>>>> >> > ARROW: 0.00029664gb
>>>> >> >
>>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes are
>>>> distinct and non-overlapping, but it seems strange that the output with the
>>>> provided data has the Arrow bytes allocated at ~0GB whereas the one with my
>>>> data has the allocated data approximately equal to the dataframe size. I'm
>>>> not sure if it affects anything but my file was written with the following:
>>>> >> >
>>>> >> > import pyarrow.lib as ext
>>>> >> > import pyarrow
>>>> >> > COMPRESSION_LEVEL = 19
>>>> >> > COMPRESSION_ALGO = 'zstd'
>>>> >> > KILOBYTE = 1 << 10
>>>> >> > MEGABYTE = KILOBYTE * KILOBYTE
>>>> >> > CHUNK_SIZE = MEGABYTE
>>>> >> >
>>>> >> > table = pyarrow.Table.from_pandas(df,
>>>> preserve_index=preserve_index)
>>>> >> > ext.write_feather(table, dest, compression=compression,
>>>> compression_level=compression_level,chunksize=chunk_size, version=2)
>>>> >> >
>>>> >> > As to the discrepancy around calculating dataframe size. I'm not
>>>> sure why that would be so off for you. Going off the docs, it seems like it
>>>> should be accurate. My Dataframe in question is [7196546 rows x 56 columns]
>>>> where each column is mostly a float or integer and datetime index. 7196546
>>>> * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
>>>> >> >
>>>> >> > Thank You,
>>>> >> > Arun
>>>> >> >
>>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <[email protected]>
>>>> wrote:
>>>> >> >>
>>>> >> >> 2x overshoot of memory does seem a little high.  Are you
>>>> including the
>>>> >> >> "setup" part when you encounter that?  Arrow's file-based CSV
>>>> reader
>>>> >> >> will require 2-3x memory usage because it buffers the bytes in
>>>> memory
>>>> >> >> in case it needs to re-convert them later (because it realizes the
>>>> >> >> data type for the column is different).  I'm not sure if Panda's
>>>> CSV
>>>> >> >> reader is similar.
>>>> >> >>
>>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold
>>>> on
>>>> >> >> to a bit more memory and hold onto it (for a little while at
>>>> least)
>>>> >> >> even after it is no longer used.  Even malloc will hold onto
>>>> memory
>>>> >> >> sometimes due to fragmentation or other concerns.  You could try
>>>> >> >> changing to the system allocator
>>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your
>>>> file)
>>>> >> >> to see if that makes a difference.
>>>> >> >>
>>>> >> >> I'm not sure your method of calculating the dataframe size is
>>>> >> >> reliable.  I don't actually know enough about pandas but when I
>>>> tried
>>>> >> >> your experiment with my own 1.9G CSV file it ended up reporting:
>>>> >> >>
>>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>>>> >> >>
>>>> >> >> which seems suspicious.
>>>> >> >>
>>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't
>>>> seem
>>>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself
>>>> was
>>>> >> >> only using ~1.9GB and I will naively assume the difference
>>>> between the
>>>> >> >> two is bloat caused by object wrappers when converting to pandas.
>>>> >> >>
>>>> >> >> Another thing you might try and measure is
>>>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you
>>>> how
>>>> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB
>>>> then it
>>>> >> >> is a pretty good guess that memory is being held somewhere else.
>>>> >> >>
>>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <[email protected]>
>>>> wrote:
>>>> >> >> >
>>>> >> >> > Hi Apache Arrow Members,
>>>> >> >> >
>>>> >> >> > My question is below but I've compiled a minimum reproducible
>>>> example with a public dataset:
>>>> >> >> >
>>>> >> >> > import pandas as pd
>>>> >> >> > from pyarrow import feather
>>>> >> >> > import os
>>>> >> >> > import psutil
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > def setup():
>>>> >> >> >   df = pd.read_csv('
>>>> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
>>>> ')
>>>> >> >> >   df.to_feather('test.csv')
>>>> >> >> >
>>>> >> >> > if __name__ == "__main__":
>>>> >> >> >   # setup()
>>>> >> >> >   process = psutil.Process(os.getpid())
>>>> >> >> >   path = 'test.csv'
>>>> >> >> >
>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
>>>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>>>> >> >> >
>>>> >> >> >   df = feather.read_feather(path)
>>>> >> >> >
>>>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
>>>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>>>> >> >> >
>>>> >> >> > I substituted my df with a sample csv. I had trouble finding a
>>>> sample CSV of adequate size however, my dataset is ~3GB, and I see memory
>>>> usage of close to 6GB.
>>>> >> >> >
>>>> >> >> > Output with My Data:
>>>> >> >> > BEFORE mem_size: 0.088891392gb
>>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>>>> >> >> >
>>>> >> >> > It seems strange that the overall memory usage of the process
>>>> is approx double of the size of the dataframe itself. Is there a reason for
>>>> this, and is there a way to mitigate this?
>>>> >> >> >
>>>> >> >> > $ conda list pyarrow
>>>> >> >> > #
>>>> >> >> > # Name                    Version                   Build
>>>> Channel
>>>> >> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu
>>>>   conda-forge
>>>> >> >> >
>>>> >> >> > Thank You,
>>>> >> >> > Arun Joseph
>>>> >> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Arun Joseph
>>>> >> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Arun Joseph
>>>> >
>>>>
>>>
>>>
>>> --
>>> Arun Joseph
>>>
>>>
>>
>> --
>> Arun Joseph
>>
>>
>
> --
> Arun Joseph
>
>

-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Reply via email to