Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Weston Pace Thu, 09 Dec 2021 17:21:33 -0800

> Thank you Wes and David for the in-depth responses.

Just as an aside, I go by Weston, as there is already a Wes on this
mailing list and it can get confusing ;)


> I also created a stack overflow post...I hope that is ok and or useful. 
> Otherwise I can remove it.

I think that's fine, SO can have greater reach than the mailing list.

> As for guppy double counting, that is really strange.

I agree, maybe guppy has difficulty identifying when two python
objects reference the same underlying chunk of C memory.  Here's a
quick simple example of it getting confused (I'll add this to the SO
post):

import numpy as np
import os
import psutil
import pyarrow as pa
from guppy import hpy

process = psutil.Process(os.getpid())

x = np.random.rand(100000000)
print(hpy().heap())
print(process.memory_info().rss)

# This is a zero-copy operation.  Note
# that RSS remains consistent.  Both x
# and arr reference the same underlying
# array of doubles.
arr = pa.array(x)
print(hpy().heap())
print(process.memory_info().rss)

> By deleting the Reader, do you mean just doing a `del Reader` or `Reader = 
> None`?

I was thinking "del reader" but "reader = None" should achieve the
same effect (I think?)

On Tue, Dec 7, 2021 at 2:44 PM Arun Joseph <[email protected]> wrote:
>
> Thank you Wes and David for the in-depth responses. I also created a stack 
> overflow post on this with an example/outputs (before I saw your responses, 
> but updated after I saw it), I hope that is ok and or useful. Otherwise I can 
> remove it.
>
> # I'm pretty sure guppy3 is double-counting.
> As for guppy double counting, that is really strange. I think you're looking 
> at the cumulative size re: the dataframe size. Although, I did observe what 
> you are describing when I was generating the test data as well. In the stack 
> overflow post, I have another example which prints out the RSS and the guppy 
> heap output and it does not seem like there is double counting for the normal 
> run. I also included a sleep at the end before recording the heap and RSS as 
> well.
>
> # I think split_blocks and self_destruct is the best answer at the moment.  
> self_destruct has remained in the code since at least 1.0.0 so perhaps it is 
> time we remove the "experimental" flag and maybe replace it with a "caution" 
> or "danger" flag (as it causes the table to become unusable afterwards). In 
> terms of the closest immediate fix, split_blocks and self_destruct do seem 
> like the best choices.
> Yes I agree. I'll be incorporating these changes in my actual codebase. While 
> they don't always work, there should be some improvement.
>
> # I did see some strange behavior when working with the RecordBatchFileReader 
> and I opened ARROW-15017 to resolve this but you can work around this by 
> deleting the reader.
> By deleting the Reader, do you mean just doing a `del Reader` or `Reader = 
> None`?
>
> # Note that to minimize the memory usage, you should also pass 
> use_threads=False
> I will also try this out, thank you
>
> On Tue, Dec 7, 2021 at 6:32 PM David Li <[email protected]> wrote:
>>
>> Just for edification (though I have limited understanding of the machinery 
>> here, someone more familiar with Pandas internals may have more insight/this 
>> may be wrong or very outdated!):
>>
>> zero_copy_only does not work for two reasons (well, one reason 
>> fundamentally): the representation in memory of a Pandas dataframe has been 
>> a dense, 2D NumPy array per column type. In other words, all data across all 
>> columns of the same type are contiguous in memory. (At least historically. 
>> My understanding is that this has changed/become more flexible relatively 
>> recently.) This is the representation that Arrow tries to generate by 
>> default. (See https://uwekorn.com/2020/05/24/the-one-pandas-internal.html.)
>>
>> However, the Arrow table you have is not contiguous: each column is 
>> allocated separately, and for a Table, each column is made up of a list of 
>> contiguous chunks. So there are very few cases where data can be 
>> zero-copied, it must instead be copied and "compacted".
>>
>> The split_blocks option *helps* work around this. It allows each column in 
>> the Pandas DataFrame to be its own allocation. However, each individual 
>> column must still be contiguous. If you try zero_copy_only with 
>> split_blocks, you'll get a different error message, this is because the 
>> columns of your Arrow Table have more than one chunk. If you create a small 
>> in-memory Table with only one column with one chunk, zero_copy_only + 
>> split_blocks will work!
>>
>> split_blocks with self_destruct works in this case still because 
>> self_destruct will still copy data, it will just also try to free the Arrow 
>> data as each column is converted. (Note that to minimize the memory usage, 
>> you should also pass use_threads=False. In that case, the maximum memory 
>> overhead should be one column's worth.)
>>
>> -David
>>
>> On Tue, Dec 7, 2021, at 18:09, Weston Pace wrote:
>>
>> Thank you for the new example.
>>
>> # Why is it 2x?
>>
>> This is essentially a "peak RAM" usage of the operation.  Given that
>> split_blocks helped I think we can attribute this doubling to the
>> pandas conversion.
>>
>> # Why doesn't the memory get returned?
>>
>> It does, it just doesn't do so immediately.  If I put a 5 second sleep
>> before I print the memory I see that the RSS shrinks down.  This is
>> how jemalloc is configured in Arrow (actually I think it is 1 second)
>> for releasing RSS after reaching peak consumption.
>>
>> BEFORE mem_size: 0.082276352gb
>> AFTER: mem_size: 6.68639232gb df_size: 3.281625104gb
>> AFTER-ARROW: 3.281625024gb
>> ---five second sleep---
>> AFTER-SLEEP: mem_size: 3.3795072gb df_size: 3.281625104gb
>> AFTER-SLEEP-ARROW: 3.281625024gb
>>
>> # Why didn't switching to the system allocator help?
>>
>> The problem isn't "the dynamic allocator is allocating more than it
>> needs".  There is a point in this process where ~6GB are actually
>> needed.  The system allocator either also holds on to that RSS for a
>> little bit or the RSS numbers themselves take a little bit of time to
>> update.  I'm not entirely sure.
>>
>> # Why isn't this a zero-copy conversion to pandas?
>>
>> That's a good question, I don't know the details.  If I try manually
>> doing the conversion with zero_copy_only I get the error "Cannot do
>> zero copy conversion into multi-column DataFrame block"
>>
>> # What is up with the numpy.ndarray objects in the heap?
>>
>> I'm pretty sure guppy3 is double-counting.  Note that the total size
>> is ~20GB.  I've been able to reproduce this in cases where the heap is
>> 3GB and guppy still shows the dataframe taking up 6GB.  In fact, I
>> once even managed to generate this:
>>
>> AFTER-SLEEP: mem_size: 3.435835392gb df_size: 3.339197344gb
>> AFTER-SLEEP-ARROW: 0.0gb
>> Partition of a set of 212560 objects. Total size = 13328742559 bytes.
>> Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
>>      0     57   0 6563250864  49 6563250864  49 pandas.core.series.Series
>>      1    133   0 3339213718  25 9902464582  74 numpy.ndarray
>>      2      1   0 3339197360  25 13241661942  99 pandas.core.frame.DataFrame
>>
>> The RSS is 3.44GB but guppy reports the dataframe as 13GB.
>>
>> I did see some strange behavior when working with the
>> RecordBatchFileReader and I opened ARROW-15017 to resolve this but you
>> can work around this by deleting the reader.
>>
>> # Can I return the data immediately / I don't want to use 2x memory 
>> consumption
>>
>> I think split_blocks and self_destruct is the best answer at the
>> moment.  self_destruct has remained in the code since at least 1.0.0
>> so perhaps it is time we remove the "experimental" flag and maybe
>> replace it with a "caution" or "danger" flag (as it causes the table
>> to become unusable afterwards).
>>
>> Jemalloc has some manual facilities to purge dirty memory and we
>> expose some of them with
>> pyarrow.default_memory_pool().release_unused() but that doesn't seem
>> to be helping in this situation.  Either the excess memory is in the
>> non-jemalloc pool or the jemalloc command can't quite release this
>> memory, or the RSS stats are just stale.  I'm not entirely sure.
>>
>> On Tue, Dec 7, 2021 at 11:54 AM Arun Joseph <[email protected]> wrote:
>> >
>> > Slightly related, I have some other code that opens up an arrow file using 
>> > a `pyarrow.ipc.RecordBatchFileReader` and then converts RecordBatch to a 
>> > pandas dataframe. After this conversion is done, and I inspect the heap, I 
>> > always see the following:
>> >
>> > hpy().heap()
>> > Partition of a set of 351136 objects. Total size = 20112096840 bytes.
>> >  Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
>> >      0    121   0 9939601034  49 9939601034  49 numpy.ndarray
>> >      1      1   0 9939585700  49 19879186734  99 
>> > pandas.core.frame.DataFrame
>> >      2      1   0 185786680   1 20064973414 100 
>> > pandas.core.indexes.datetimes.DatetimeIndex
>> >
>> > Specifically the numpy.ndarray. It only shows up after the conversion and 
>> > it does not seem to go away. It also seems to be roughly the same size as 
>> > the dataframe itself.
>> >
>> > - Arun
>> >
>> > On Tue, Dec 7, 2021 at 10:21 AM Arun Joseph <[email protected]> wrote:
>> >>
>> >> Just to follow up on this, is there a way to manually force the arrow 
>> >> pool to de-allocate? My usecase is essentially having multiple processes 
>> >> in a Pool or via Slurm read from an arrow file, do some work, and then 
>> >> exit. Issue is that the 2x memory consumption reduces the bandwidth on 
>> >> the machine to effectively half.
>> >>
>> >> Thank You,
>> >> Arun
>> >>
>> >> On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <[email protected]> wrote:
>> >>>
>> >>> Additionally, I tested with my actual data, and did not see memory 
>> >>> savings.
>> >>>
>> >>> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <[email protected]> wrote:
>> >>>>
>> >>>> Hi Joris,
>> >>>>
>> >>>> Thank you for the explanation. The 2x memory consumption on conversion 
>> >>>> makes sense if there is a copy, but it does seem like it persists 
>> >>>> longer than it should. Might be because of python's GC policies?
>> >>>> I tried out your recommendations but they did not seem to work. 
>> >>>> However, I did notice an experimental option on `to_pandas`, 
>> >>>> `self_destruct`, which seems to address the issue I'm facing. Sadly, 
>> >>>> that itself did not work either... but, combined with 
>> >>>> `split_blocks=True`, I am seeing memory savings:
>> >>>>
>> >>>> import pandas as pd
>> >>>> import numpy as np
>> >>>> import pyarrow as pa
>> >>>> from pyarrow import feather
>> >>>> import os
>> >>>> import psutil
>> >>>> pa.set_memory_pool(pa.system_memory_pool())
>> >>>> DATA_FILE = 'test.arrow'
>> >>>>
>> >>>> def setup():
>> >>>>   np.random.seed(0)
>> >>>>   df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)), 
>> >>>> columns=list([f'{i}' for i in range(57)]))
>> >>>>   df.to_feather(DATA_FILE)
>> >>>>   print(f'wrote {DATA_FILE}')
>> >>>>   import sys
>> >>>>   sys.exit()
>> >>>>
>> >>>> if __name__ == "__main__":
>> >>>>   # setup()
>> >>>>   process = psutil.Process(os.getpid())
>> >>>>   path = DATA_FILE
>> >>>>
>> >>>>   mem_size = process.memory_info().rss / 1e9
>> >>>>   print(f'BEFORE mem_size: {mem_size}gb')
>> >>>>
>> >>>>   feather_table = feather.read_table(path)
>> >>>>   # df = feather_table.to_pandas(split_blocks=True)
>> >>>>   # df = feather_table.to_pandas()
>> >>>>   df = feather_table.to_pandas(self_destruct=True, split_blocks=True)
>> >>>>
>> >>>>   mem_size = process.memory_info().rss / 1e9
>> >>>>   df_size = df.memory_usage().sum() / 1e9
>> >>>>   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >>>>   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb')
>> >>>>
>> >>>>
>> >>>> OUTPUT(to_pandas()):
>> >>>> BEFORE mem_size: 0.091795456gb
>> >>>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb
>> >>>> ARROW: 3.281625024gb
>> >>>>
>> >>>> OUTPUT (to_pandas(split_blocks=True)):
>> >>>> BEFORE mem_size: 0.091795456gb
>> >>>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb
>> >>>> ARROW: 3.281627712gb
>> >>>>
>> >>>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)):
>> >>>> BEFORE mem_size: 0.091795456gb
>> >>>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb
>> >>>> ARROW: 3.281627712gb
>> >>>>
>> >>>> I'm guessing since this feature is experimental, it might either go 
>> >>>> away, or might have strange behaviors. Is there anything I should look 
>> >>>> out for, or is there some alternative to reproduce these results?
>> >>>>
>> >>>> Thank You,
>> >>>> Arun
>> >>>>
>> >>>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche 
>> >>>> <[email protected]> wrote:
>> >>>>>
>> >>>>> Hi Aron, Weston,
>> >>>>>
>> >>>>> I didn't try running the script locally, but a quick note: the
>> >>>>> `feather.read_feather` function reads the Feather file into an Arrow
>> >>>>> table ànd directly converts it to a pandas DataFrame. A memory
>> >>>>> consumption 2x the size of the dataframe sounds not that unexpected to
>> >>>>> me: most of the time, when converting an arrow table to a pandas
>> >>>>> DataFrame, the data will be copied to accommodate for pandas' specific
>> >>>>> internal memory layout (at least numeric columns will be combined
>> >>>>> together in 2D arrays).
>> >>>>>
>> >>>>> To verify if this is the cause, you might want to do either of:
>> >>>>> - use `feather.read_table` instead of `feather.read_feather`, which
>> >>>>> will read the file as an Arrow table instead (and don't do any
>> >>>>> conversion to pandas)
>> >>>>> - if you want to include the conversion to pandas, also use
>> >>>>> `read_table` and do the conversion to pandas explicitly with a
>> >>>>> `to_pandas()` call on the result. In that case, you can specify
>> >>>>> `split_blocks=True` to use more zero-copy conversion in the
>> >>>>> arrow->pandas conversion
>> >>>>>
>> >>>>> Joris
>> >>>>>
>> >>>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <[email protected]> wrote:
>> >>>>> >
>> >>>>> > Hi Wes,
>> >>>>> >
>> >>>>> > Sorry for the late reply on this, but I think I got a reproducible 
>> >>>>> > test case:
>> >>>>> >
>> >>>>> > import pandas as pd
>> >>>>> > import numpy as np
>> >>>>> > import pyarrow as pa
>> >>>>> > from pyarrow import feather
>> >>>>> > import os
>> >>>>> > import psutil
>> >>>>> > pa.set_memory_pool(pa.system_memory_pool())
>> >>>>> > DATA_FILE = 'test.arrow'
>> >>>>> >
>> >>>>> > def setup():
>> >>>>> >   np.random.seed(0)
>> >>>>> >   df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)), 
>> >>>>> > columns=list([f'i_{i}' for i in range(57)]))
>> >>>>> >   df.to_feather(DATA_FILE)
>> >>>>> >   print(f'wrote {DATA_FILE}')
>> >>>>> >   import sys
>> >>>>> >   sys.exit()
>> >>>>> >
>> >>>>> > if __name__ == "__main__":
>> >>>>> >   # setup()
>> >>>>> >   process = psutil.Process(os.getpid())
>> >>>>> >   path = DATA_FILE
>> >>>>> >
>> >>>>> >   mem_size = process.memory_info().rss / 1e9
>> >>>>> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >>>>> >
>> >>>>> >   df = feather.read_feather(path)
>> >>>>> >
>> >>>>> >   mem_size = process.memory_info().rss / 1e9
>> >>>>> >   df_size = df.memory_usage().sum() / 1e9
>> >>>>> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >>>>> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 
>> >>>>> > 1e9}gb')
>> >>>>> >
>> >>>>> > OUTPUT:
>> >>>>> > BEFORE mem_size: 0.091795456gb
>> >>>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb
>> >>>>> > ARROW: 3.281625024gb
>> >>>>> >
>> >>>>> > Let me know if you're able to see similar results.
>> >>>>> >
>> >>>>> > Thanks,
>> >>>>> > Arun
>> >>>>> >
>> >>>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <[email protected]> 
>> >>>>> > wrote:
>> >>>>> >>
>> >>>>> >> I get more or less the same results as you for the provided setup 
>> >>>>> >> data
>> >>>>> >> (exact same #'s for arrow & df_size and slightly different for RSS
>> >>>>> >> which is to be expected).  The fact that the arrow size is much 
>> >>>>> >> lower
>> >>>>> >> than the dataframe size is not too surprising to me.  If a column
>> >>>>> >> can't be zero copied then it's memory will disappear from the arrow
>> >>>>> >> pool (I think).  Plus, object columns will have overhead in pandas
>> >>>>> >> that they do not have in Arrow.
>> >>>>> >>
>> >>>>> >> The df_size issue for me seems to be tied to string columns.  I 
>> >>>>> >> think
>> >>>>> >> pandas is overestimating how much size is needed there (many of my
>> >>>>> >> strings are similar and I wonder if some kind of object sharing is
>> >>>>> >> happening).  But we can table this for another time.
>> >>>>> >>
>> >>>>> >> I tried writing my feather file with your parameters and it didn't
>> >>>>> >> have much impact on any of the numbers.
>> >>>>> >>
>> >>>>> >> Since the arrow size for you is expected (nearly the same as the
>> >>>>> >> df_size) I'm not sure what to investigate next.  The memory does not
>> >>>>> >> seem to be retained by Arrow.  Is there any chance you could create 
>> >>>>> >> a
>> >>>>> >> reproducible test case using randomly generated numpy data (then you
>> >>>>> >> could share that setup function)?
>> >>>>> >>
>> >>>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <[email protected]> 
>> >>>>> >> wrote:
>> >>>>> >> >
>> >>>>> >> > Hi Wes,
>> >>>>> >> >
>> >>>>> >> > I'm not including the setup() call when I encounter the issue. I 
>> >>>>> >> > just kept it in there for ease of reproducibility. Memory usage 
>> >>>>> >> > is indeed higher when it is included, but that isn't surprising.
>> >>>>> >> >
>> >>>>> >> > I tried switching over to the system allocator but there is no 
>> >>>>> >> > change.
>> >>>>> >> >
>> >>>>> >> > I've updated to Arrow 6.0.1 as well and there is no change.
>> >>>>> >> >
>> >>>>> >> > I updated my script to also include the Arrow bytes allocated and 
>> >>>>> >> > it gave me the following:
>> >>>>> >> >
>> >>>>> >> > MVE:
>> >>>>> >> > import pandas as pd
>> >>>>> >> > import pyarrow as pa
>> >>>>> >> > from pyarrow import feather
>> >>>>> >> > import os
>> >>>>> >> > import psutil
>> >>>>> >> > pa.set_memory_pool(pa.system_memory_pool())
>> >>>>> >> >
>> >>>>> >> >
>> >>>>> >> > def setup():
>> >>>>> >> >   df = 
>> >>>>> >> > pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
>> >>>>> >> >   df.to_feather('test.csv')
>> >>>>> >> >
>> >>>>> >> > if __name__ == "__main__":
>> >>>>> >> >   # setup()
>> >>>>> >> >   process = psutil.Process(os.getpid())
>> >>>>> >> >   path = 'test.csv'
>> >>>>> >> >
>> >>>>> >> >   mem_size = process.memory_info().rss / 1e9
>> >>>>> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >>>>> >> >
>> >>>>> >> >   df = feather.read_feather(path)
>> >>>>> >> >
>> >>>>> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>> >>>>> >> >   mem_size = process.memory_info().rss / 1e10
>> >>>>> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >>>>> >> >   print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 
>> >>>>> >> > 1e9}gb')
>> >>>>> >> >
>> >>>>> >> > Output with my data:
>> >>>>> >> > BEFORE mem_size: 0.08761344gb
>> >>>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb
>> >>>>> >> > ARROW: 3.080121792gb
>> >>>>> >> >
>> >>>>> >> > Output with Provided Setup Data:
>> >>>>> >> > BEFORE mem_size: 0.09179136gb
>> >>>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb
>> >>>>> >> > ARROW: 0.00029664gb
>> >>>>> >> >
>> >>>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes are 
>> >>>>> >> > distinct and non-overlapping, but it seems strange that the 
>> >>>>> >> > output with the provided data has the Arrow bytes allocated at 
>> >>>>> >> > ~0GB whereas the one with my data has the allocated data 
>> >>>>> >> > approximately equal to the dataframe size. I'm not sure if it 
>> >>>>> >> > affects anything but my file was written with the following:
>> >>>>> >> >
>> >>>>> >> > import pyarrow.lib as ext
>> >>>>> >> > import pyarrow
>> >>>>> >> > COMPRESSION_LEVEL = 19
>> >>>>> >> > COMPRESSION_ALGO = 'zstd'
>> >>>>> >> > KILOBYTE = 1 << 10
>> >>>>> >> > MEGABYTE = KILOBYTE * KILOBYTE
>> >>>>> >> > CHUNK_SIZE = MEGABYTE
>> >>>>> >> >
>> >>>>> >> > table = pyarrow.Table.from_pandas(df, 
>> >>>>> >> > preserve_index=preserve_index)
>> >>>>> >> > ext.write_feather(table, dest, compression=compression, 
>> >>>>> >> > compression_level=compression_level,chunksize=chunk_size, 
>> >>>>> >> > version=2)
>> >>>>> >> >
>> >>>>> >> > As to the discrepancy around calculating dataframe size. I'm not 
>> >>>>> >> > sure why that would be so off for you. Going off the docs, it 
>> >>>>> >> > seems like it should be accurate. My Dataframe in question is 
>> >>>>> >> > [7196546 rows x 56 columns] where each column is mostly a float 
>> >>>>> >> > or integer and datetime index. 7196546 * 56 * 8 = 3224052608 ~= 
>> >>>>> >> > 3.2GB which roughly aligns.
>> >>>>> >> >
>> >>>>> >> > Thank You,
>> >>>>> >> > Arun
>> >>>>> >> >
>> >>>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace 
>> >>>>> >> > <[email protected]> wrote:
>> >>>>> >> >>
>> >>>>> >> >> 2x overshoot of memory does seem a little high.  Are you 
>> >>>>> >> >> including the
>> >>>>> >> >> "setup" part when you encounter that?  Arrow's file-based CSV 
>> >>>>> >> >> reader
>> >>>>> >> >> will require 2-3x memory usage because it buffers the bytes in 
>> >>>>> >> >> memory
>> >>>>> >> >> in case it needs to re-convert them later (because it realizes 
>> >>>>> >> >> the
>> >>>>> >> >> data type for the column is different).  I'm not sure if Panda's 
>> >>>>> >> >> CSV
>> >>>>> >> >> reader is similar.
>> >>>>> >> >>
>> >>>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to 
>> >>>>> >> >> hold on
>> >>>>> >> >> to a bit more memory and hold onto it (for a little while at 
>> >>>>> >> >> least)
>> >>>>> >> >> even after it is no longer used.  Even malloc will hold onto 
>> >>>>> >> >> memory
>> >>>>> >> >> sometimes due to fragmentation or other concerns.  You could try
>> >>>>> >> >> changing to the system allocator
>> >>>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your 
>> >>>>> >> >> file)
>> >>>>> >> >> to see if that makes a difference.
>> >>>>> >> >>
>> >>>>> >> >> I'm not sure your method of calculating the dataframe size is
>> >>>>> >> >> reliable.  I don't actually know enough about pandas but when I 
>> >>>>> >> >> tried
>> >>>>> >> >> your experiment with my own 1.9G CSV file it ended up reporting:
>> >>>>> >> >>
>> >>>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb
>> >>>>> >> >>
>> >>>>> >> >> which seems suspicious.
>> >>>>> >> >>
>> >>>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't 
>> >>>>> >> >> seem
>> >>>>> >> >> all that unexpected.  There was 2.348GB of usage.  Arrow itself 
>> >>>>> >> >> was
>> >>>>> >> >> only using ~1.9GB and I will naively assume the difference 
>> >>>>> >> >> between the
>> >>>>> >> >> two is bloat caused by object wrappers when converting to pandas.
>> >>>>> >> >>
>> >>>>> >> >> Another thing you might try and measure is
>> >>>>> >> >> `pa.default_memory_pool().bytes_allocated()`.  This will tell 
>> >>>>> >> >> you how
>> >>>>> >> >> much memory Arrow itself is hanging onto.  If that is not 6GB 
>> >>>>> >> >> then it
>> >>>>> >> >> is a pretty good guess that memory is being held somewhere else.
>> >>>>> >> >>
>> >>>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <[email protected]> 
>> >>>>> >> >> wrote:
>> >>>>> >> >> >
>> >>>>> >> >> > Hi Apache Arrow Members,
>> >>>>> >> >> >
>> >>>>> >> >> > My question is below but I've compiled a minimum reproducible 
>> >>>>> >> >> > example with a public dataset:
>> >>>>> >> >> >
>> >>>>> >> >> > import pandas as pd
>> >>>>> >> >> > from pyarrow import feather
>> >>>>> >> >> > import os
>> >>>>> >> >> > import psutil
>> >>>>> >> >> >
>> >>>>> >> >> >
>> >>>>> >> >> > def setup():
>> >>>>> >> >> >   df = 
>> >>>>> >> >> > pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv')
>> >>>>> >> >> >   df.to_feather('test.csv')
>> >>>>> >> >> >
>> >>>>> >> >> > if __name__ == "__main__":
>> >>>>> >> >> >   # setup()
>> >>>>> >> >> >   process = psutil.Process(os.getpid())
>> >>>>> >> >> >   path = 'test.csv'
>> >>>>> >> >> >
>> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
>> >>>>> >> >> >   print(f'BEFORE mem_size: {mem_size}gb')
>> >>>>> >> >> >
>> >>>>> >> >> >   df = feather.read_feather(path)
>> >>>>> >> >> >
>> >>>>> >> >> >   df_size = df.memory_usage(deep=True).sum() / 1e9
>> >>>>> >> >> >   mem_size = process.memory_info().rss / 1e9
>> >>>>> >> >> >   print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb')
>> >>>>> >> >> >
>> >>>>> >> >> > I substituted my df with a sample csv. I had trouble finding a 
>> >>>>> >> >> > sample CSV of adequate size however, my dataset is ~3GB, and I 
>> >>>>> >> >> > see memory usage of close to 6GB.
>> >>>>> >> >> >
>> >>>>> >> >> > Output with My Data:
>> >>>>> >> >> > BEFORE mem_size: 0.088891392gb
>> >>>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb
>> >>>>> >> >> >
>> >>>>> >> >> > It seems strange that the overall memory usage of the process 
>> >>>>> >> >> > is approx double of the size of the dataframe itself. Is there 
>> >>>>> >> >> > a reason for this, and is there a way to mitigate this?
>> >>>>> >> >> >
>> >>>>> >> >> > $ conda list pyarrow
>> >>>>> >> >> > #
>> >>>>> >> >> > # Name                    Version                   Build  
>> >>>>> >> >> > Channel
>> >>>>> >> >> > pyarrow                   4.0.1           py37h0f64622_13_cpu  
>> >>>>> >> >> >   conda-forge
>> >>>>> >> >> >
>> >>>>> >> >> > Thank You,
>> >>>>> >> >> > Arun Joseph
>> >>>>> >> >> >
>> >>>>> >> >
>> >>>>> >> >
>> >>>>> >> >
>> >>>>> >> > --
>> >>>>> >> > Arun Joseph
>> >>>>> >> >
>> >>>>> >
>> >>>>> >
>> >>>>> >
>> >>>>> > --
>> >>>>> > Arun Joseph
>> >>>>> >
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Arun Joseph
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> Arun Joseph
>> >>>
>> >>
>> >>
>> >> --
>> >> Arun Joseph
>> >>
>> >
>> >
>> > --
>> > Arun Joseph
>> >
>>
>>
>
>
> --
> Arun Joseph
>

Re: [Python] Why does reading an arrow file cause almost double the memory consumption?

Reply via email to