Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Arun Joseph Mon, 06 Dec 2021 07:39:16 -0800

Additionally, I tested with my actual data, and did not see memory savings.


On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <[email protected]> wrote:

> Hi Joris,
>
> Thank you for the explanation. The 2x memory consumption on conversion
> makes sense if there is a copy, but it does seem like it persists longer
> than it should. Might be because of python's GC policies?
> I tried out your recommendations but they did not seem to work. However, I
> did notice an experimental option on `to_pandas`, `self_destruct`, which
> seems to address the issue I'm facing. Sadly, that itself did not work
> either... but, combined with `split_blocks=True`, I am seeing memory
> savings:
>
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> from pyarrow import feather
> import os
> import psutil
> pa.set_memory_pool(pa.system_memory_pool())
> DATA_FILE = 'test.arrow'
>
> def setup():
>   np.random.seed(0)
>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)),
> columns=list([f'{i}' for i in range(57)]))
>   df.to_feather(DATA_FILE)
>   print(f'wrote {DATA_FILE}')
>   import sys
>   sys.exit()
>
> if __name__ == "__main__":
>   # setup()
>   process = psutil.Process(os.getpid())
>   path = DATA_FILE
>
>   mem_size = process.memory_info().rss / 1e9
>   print(f'BEFORE mem_size: {mem_size}gb')
>
>   feather_table = feather.read_table(path)
>   # df = feather_table.to_pandas(split_blocks=True)
>   # df = feather_table.to_pandas()
>   df = feather_table.to_pandas(self_destruct=True, split_blocks=True)
>
>   mem_size = process.memory_info().rss / 1e9
>   df_size = df.memory_usage().sum() / 1e9
>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>
>
> OUTPUT(to_pandas()):
> BEFORE mem_size: 0.091795456gb
> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
> ARROW: 3.281625024gb
>
> OUTPUT (to_pandas(split_blocks=True)):
> BEFORE mem_size: 0.091795456gb
> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
> ARROW: 3.281627712gb
>
> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
> BEFORE mem_size: 0.091795456gb
> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
> ARROW: 3.281627712gb
>
> I'm guessing since this feature is experimental, it might either go away,
> or might have strange behaviors. Is there anything I should look out for,
> or is there some alternative to reproduce these results?
>
> Thank You,
> Arun
>
> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche <
> [email protected]> wrote:
>
>> Hi Aron, Weston,
>>
>> I didn't try running the script locally, but a quick note: the
>> `feather.read_feather` function reads the Feather file into an Arrow
>> table ànd directly converts it to a pandas DataFrame. A memory
>> consumption 2x the size of the dataframe sounds not that unexpected to
>> me: most of the time, when converting an arrow table to a pandas
>> DataFrame, the data will be copied to accommodate for pandas' specific
>> internal memory layout (at least numeric columns will be combined
>> together in 2D arrays).
>>
>> To verify if this is the cause, you might want to do either of:
>> - use `feather.read_table` instead of `feather.read_feather`, which
>> will read the file as an Arrow table instead (and don't do any
>> conversion to pandas)
>> - if you want to include the conversion to pandas, also use
>> `read_table` and do the conversion to pandas explicitly with a
>> `to_pandas()` call on the result. In that case, you can specify
>> `split_blocks=True` to use more zero-copy conversion in the
>> arrow->pandas conversion
>>
>> Joris
>>
>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <[email protected]> wrote:
>> >
>> > Hi Wes,
>> >
>> > Sorry for the late reply on this, but I think I got a reproducible test
>> case:
>> >
>> > import pandas as pd
>> > import numpy as np
>> > import pyarrow as pa
>> > from pyarrow import feather
>> > import os
>> > import psutil
>> > pa.set_memory_pool(pa.system_memory_pool())
>> > DATA_FILE = 'test.arrow'
>> >
>> > def setup():
>> >   np.random.seed(0)
>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)),
>> columns=list([f'i_{i}' for i in range(57)]))
>> >   df.to_feather(DATA_FILE)
>> >   print(f'wrote {DATA_FILE}')
>> >   import sys
>> >   sys.exit()
>> >
>> > if __name__ == "__main__":
>> >   # setup()
>> >   process = psutil.Process(os.getpid())
>> >   path = DATA_FILE
>> >
>> >   mem_size = process.memory_info().rss / 1e9
>> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >
>> >   df = feather.read_feather(path)
>> >
>> >   mem_size = process.memory_info().rss / 1e9
>> >   df_size = df.memory_usage().sum() / 1e9
>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>> >
>> > OUTPUT:
>> > BEFORE mem_size: 0.091795456gb
>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
>> > ARROW: 3.281625024gb
>> >
>> > Let me know if you're able to see similar results.
>> >
>> > Thanks,
>> > Arun
>> >
>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <[email protected]>
>> wrote:
>> >>
>> >> I get more or less the same results as you for the provided setup data
>> >> (exact same #'s for arrow & df_size and slightly different for RSS
>> >> which is to be expected).  The fact that the arrow size is much lower
>> >> than the dataframe size is not too surprising to me.  If a column
>> >> can't be zero copied then it's memory will disappear from the arrow
>> >> pool (I think).  Plus, object columns will have overhead in pandas
>> >> that they do not have in Arrow.
>> >>
>> >> The df_size issue for me seems to be tied to string columns.  I think
>> >> pandas is overestimating how much size is needed there (many of my
>> >> strings are similar and I wonder if some kind of object sharing is
>> >> happening).  But we can table this for another time.
>> >>
>> >> I tried writing my feather file with your parameters and it didn't
>> >> have much impact on any of the numbers.
>> >>
>> >> Since the arrow size for you is expected (nearly the same as the
>> >> df_size) I'm not sure what to investigate next.  The memory does not
>> >> seem to be retained by Arrow.  Is there any chance you could create a
>> >> reproducible test case using randomly generated numpy data (then you
>> >> could share that setup function)?
>> >>
>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <[email protected]> wrote:
>> >> >
>> >> > Hi Wes,
>> >> >
>> >> > I'm not including the setup() call when I encounter the issue. I
>> just kept it in there for ease of reproducibility. Memory usage is indeed
>> higher when it is included, but that isn't surprising.
>> >> >
>> >> > I tried switching over to the system allocator but there is no
>> change.
>> >> >
>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
>> >> >
>> >> > I updated my script to also include the Arrow bytes allocated and it
>> gave me the following:
>> >> >
>> >> > MVE:
>> >> > import pandas as pd
>> >> > import pyarrow as pa
>> >> > from pyarrow import feather
>> >> > import os
>> >> > import psutil
>> >> > pa.set_memory_pool(pa.system_memory_pool())
>> >> >
>> >> >
>> >> > def setup():
>> >> >   df = pd.read_csv('
>> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
>> ')
>> >> >   df.to_feather('test.csv')
>> >> >
>> >> > if __name__ == "__main__":
>> >> >   # setup()
>> >> >   process = psutil.Process(os.getpid())
>> >> >   path = 'test.csv'
>> >> >
>> >> >   mem_size = process.memory_info().rss / 1e9
>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >> >
>> >> >   df = feather.read_feather(path)
>> >> >
>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>> >> >   mem_size = process.memory_info().rss / 1e10
>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() /
>> 1e9}gb')
>> >> >
>> >> > Output with my data:
>> >> > BEFORE mem_size: 0.08761344gb
>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
>> >> > ARROW: 3.080121792gb
>> >> >
>> >> > Output with Provided Setup Data:
>> >> > BEFORE mem_size: 0.09179136gb
>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
>> >> > ARROW: 0.00029664gb
>> >> >
>> >> > I'm assuming that the df and the arrow bytes allocated/sizes are
>> distinct and non-overlapping, but it seems strange that the output with the
>> provided data has the Arrow bytes allocated at ~0GB whereas the one with my
>> data has the allocated data approximately equal to the dataframe size. I'm
>> not sure if it affects anything but my file was written with the following:
>> >> >
>> >> > import pyarrow.lib as ext
>> >> > import pyarrow
>> >> > COMPRESSION_LEVEL = 19
>> >> > COMPRESSION_ALGO = 'zstd'
>> >> > KILOBYTE = 1 << 10
>> >> > MEGABYTE = KILOBYTE * KILOBYTE
>> >> > CHUNK_SIZE = MEGABYTE
>> >> >
>> >> > table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index)
>> >> > ext.write_feather(table, dest, compression=compression,
>> compression_level=compression_level,chunksize=chunk_size, version=2)
>> >> >
>> >> > As to the discrepancy around calculating dataframe size. I'm not
>> sure why that would be so off for you. Going off the docs, it seems like it
>> should be accurate. My Dataframe in question is [7196546 rows x 56 columns]
>> where each column is mostly a float or integer and datetime index. 7196546
>> * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns.
>> >> >
>> >> > Thank You,
>> >> > Arun
>> >> >
>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <[email protected]>
>> wrote:
>> >> >>
>> >> >> 2x overshoot of memory does seem a little high.  Are you including
>> the
>> >> >> "setup" part when you encounter that?  Arrow's file-based CSV reader
>> >> >> will require 2-3x memory usage because it buffers the bytes in
>> memory
>> >> >> in case it needs to re-convert them later (because it realizes the
>> >> >> data type for the column is different).  I'm not sure if Panda's CSV
>> >> >> reader is similar.
>> >> >>
>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on
>> >> >> to a bit more memory and hold onto it (for a little while at least)
>> >> >> even after it is no longer used.  Even malloc will hold onto memory
>> >> >> sometimes due to fragmentation or other concerns.  You could try
>> >> >> changing to the system allocator
>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your
>> file)
>> >> >> to see if that makes a difference.
>> >> >>
>> >> >> I'm not sure your method of calculating the dataframe size is
>> >> >> reliable.  I don't actually know enough about pandas but when I
>> tried
>> >> >> your experiment with my own 1.9G CSV file it ended up reporting:
>> >> >>
>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>> >> >>
>> >> >> which seems suspicious.
>> >> >>
>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem
>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself was
>> >> >> only using ~1.9GB and I will naively assume the difference between
>> the
>> >> >> two is bloat caused by object wrappers when converting to pandas.
>> >> >>
>> >> >> Another thing you might try and measure is
>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell you
>> how
>> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB then
>> it
>> >> >> is a pretty good guess that memory is being held somewhere else.
>> >> >>
>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <[email protected]>
>> wrote:
>> >> >> >
>> >> >> > Hi Apache Arrow Members,
>> >> >> >
>> >> >> > My question is below but I've compiled a minimum reproducible
>> example with a public dataset:
>> >> >> >
>> >> >> > import pandas as pd
>> >> >> > from pyarrow import feather
>> >> >> > import os
>> >> >> > import psutil
>> >> >> >
>> >> >> >
>> >> >> > def setup():
>> >> >> >   df = pd.read_csv('
>> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv
>> ')
>> >> >> >   df.to_feather('test.csv')
>> >> >> >
>> >> >> > if __name__ == "__main__":
>> >> >> >   # setup()
>> >> >> >   process = psutil.Process(os.getpid())
>> >> >> >   path = 'test.csv'
>> >> >> >
>> >> >> >   mem_size = process.memory_info().rss / 1e9
>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >> >> >
>> >> >> >   df = feather.read_feather(path)
>> >> >> >
>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>> >> >> >   mem_size = process.memory_info().rss / 1e9
>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >> >> >
>> >> >> > I substituted my df with a sample csv. I had trouble finding a
>> sample CSV of adequate size however, my dataset is ~3GB, and I see memory
>> usage of close to 6GB.
>> >> >> >
>> >> >> > Output with My Data:
>> >> >> > BEFORE mem_size: 0.088891392gb
>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>> >> >> >
>> >> >> > It seems strange that the overall memory usage of the process is
>> approx double of the size of the dataframe itself. Is there a reason for
>> this, and is there a way to mitigate this?
>> >> >> >
>> >> >> > $ conda list pyarrow
>> >> >> > #
>> >> >> > # Name                    Version                   Build  Channel
>> >> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu
>> conda-forge
>> >> >> >
>> >> >> > Thank You,
>> >> >> > Arun Joseph
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Arun Joseph
>> >> >
>> >
>> >
>> >
>> > --
>> > Arun Joseph
>> >
>>
>
>
> --
> Arun Joseph
>
>

-- 
Arun Joseph

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Reply via email to