> Thank you Wes and David for the in-depth responses. Just as an aside, I go by Weston, as there is already a Wes on this mailing list and it can get confusing ;)
> I also created a stack overflow post...I hope that is ok and or useful. > Otherwise I can remove it. I think that's fine, SO can have greater reach than the mailing list. > As for guppy double counting, that is really strange. I agree, maybe guppy has difficulty identifying when two python objects reference the same underlying chunk of C memory. Here's a quick simple example of it getting confused (I'll add this to the SO post): import numpy as np import os import psutil import pyarrow as pa from guppy import hpy process = psutil.Process(os.getpid()) x = np.random.rand(100000000) print(hpy().heap()) print(process.memory_info().rss) # This is a zero-copy operation. Note # that RSS remains consistent. Both x # and arr reference the same underlying # array of doubles. arr = pa.array(x) print(hpy().heap()) print(process.memory_info().rss) > By deleting the Reader, do you mean just doing a `del Reader` or `Reader = > None`? I was thinking "del reader" but "reader = None" should achieve the same effect (I think?) On Tue, Dec 7, 2021 at 2:44 PM Arun Joseph <[email protected]> wrote: > > Thank you Wes and David for the in-depth responses. I also created a stack > overflow post on this with an example/outputs (before I saw your responses, > but updated after I saw it), I hope that is ok and or useful. Otherwise I can > remove it. > > # I'm pretty sure guppy3 is double-counting. > As for guppy double counting, that is really strange. I think you're looking > at the cumulative size re: the dataframe size. Although, I did observe what > you are describing when I was generating the test data as well. In the stack > overflow post, I have another example which prints out the RSS and the guppy > heap output and it does not seem like there is double counting for the normal > run. I also included a sleep at the end before recording the heap and RSS as > well. > > # I think split_blocks and self_destruct is the best answer at the moment. > self_destruct has remained in the code since at least 1.0.0 so perhaps it is > time we remove the "experimental" flag and maybe replace it with a "caution" > or "danger" flag (as it causes the table to become unusable afterwards). In > terms of the closest immediate fix, split_blocks and self_destruct do seem > like the best choices. > Yes I agree. I'll be incorporating these changes in my actual codebase. While > they don't always work, there should be some improvement. > > # I did see some strange behavior when working with the RecordBatchFileReader > and I opened ARROW-15017 to resolve this but you can work around this by > deleting the reader. > By deleting the Reader, do you mean just doing a `del Reader` or `Reader = > None`? > > # Note that to minimize the memory usage, you should also pass > use_threads=False > I will also try this out, thank you > > On Tue, Dec 7, 2021 at 6:32 PM David Li <[email protected]> wrote: >> >> Just for edification (though I have limited understanding of the machinery >> here, someone more familiar with Pandas internals may have more insight/this >> may be wrong or very outdated!): >> >> zero_copy_only does not work for two reasons (well, one reason >> fundamentally): the representation in memory of a Pandas dataframe has been >> a dense, 2D NumPy array per column type. In other words, all data across all >> columns of the same type are contiguous in memory. (At least historically. >> My understanding is that this has changed/become more flexible relatively >> recently.) This is the representation that Arrow tries to generate by >> default. (See https://uwekorn.com/2020/05/24/the-one-pandas-internal.html.) >> >> However, the Arrow table you have is not contiguous: each column is >> allocated separately, and for a Table, each column is made up of a list of >> contiguous chunks. So there are very few cases where data can be >> zero-copied, it must instead be copied and "compacted". >> >> The split_blocks option *helps* work around this. It allows each column in >> the Pandas DataFrame to be its own allocation. However, each individual >> column must still be contiguous. If you try zero_copy_only with >> split_blocks, you'll get a different error message, this is because the >> columns of your Arrow Table have more than one chunk. If you create a small >> in-memory Table with only one column with one chunk, zero_copy_only + >> split_blocks will work! >> >> split_blocks with self_destruct works in this case still because >> self_destruct will still copy data, it will just also try to free the Arrow >> data as each column is converted. (Note that to minimize the memory usage, >> you should also pass use_threads=False. In that case, the maximum memory >> overhead should be one column's worth.) >> >> -David >> >> On Tue, Dec 7, 2021, at 18:09, Weston Pace wrote: >> >> Thank you for the new example. >> >> # Why is it 2x? >> >> This is essentially a "peak RAM" usage of the operation. Given that >> split_blocks helped I think we can attribute this doubling to the >> pandas conversion. >> >> # Why doesn't the memory get returned? >> >> It does, it just doesn't do so immediately. If I put a 5 second sleep >> before I print the memory I see that the RSS shrinks down. This is >> how jemalloc is configured in Arrow (actually I think it is 1 second) >> for releasing RSS after reaching peak consumption. >> >> BEFORE mem_size: 0.082276352gb >> AFTER: mem_size: 6.68639232gb df_size: 3.281625104gb >> AFTER-ARROW: 3.281625024gb >> ---five second sleep--- >> AFTER-SLEEP: mem_size: 3.3795072gb df_size: 3.281625104gb >> AFTER-SLEEP-ARROW: 3.281625024gb >> >> # Why didn't switching to the system allocator help? >> >> The problem isn't "the dynamic allocator is allocating more than it >> needs". There is a point in this process where ~6GB are actually >> needed. The system allocator either also holds on to that RSS for a >> little bit or the RSS numbers themselves take a little bit of time to >> update. I'm not entirely sure. >> >> # Why isn't this a zero-copy conversion to pandas? >> >> That's a good question, I don't know the details. If I try manually >> doing the conversion with zero_copy_only I get the error "Cannot do >> zero copy conversion into multi-column DataFrame block" >> >> # What is up with the numpy.ndarray objects in the heap? >> >> I'm pretty sure guppy3 is double-counting. Note that the total size >> is ~20GB. I've been able to reproduce this in cases where the heap is >> 3GB and guppy still shows the dataframe taking up 6GB. In fact, I >> once even managed to generate this: >> >> AFTER-SLEEP: mem_size: 3.435835392gb df_size: 3.339197344gb >> AFTER-SLEEP-ARROW: 0.0gb >> Partition of a set of 212560 objects. Total size = 13328742559 bytes. >> Index Count % Size % Cumulative % Kind (class / dict of class) >> 0 57 0 6563250864 49 6563250864 49 pandas.core.series.Series >> 1 133 0 3339213718 25 9902464582 74 numpy.ndarray >> 2 1 0 3339197360 25 13241661942 99 pandas.core.frame.DataFrame >> >> The RSS is 3.44GB but guppy reports the dataframe as 13GB. >> >> I did see some strange behavior when working with the >> RecordBatchFileReader and I opened ARROW-15017 to resolve this but you >> can work around this by deleting the reader. >> >> # Can I return the data immediately / I don't want to use 2x memory >> consumption >> >> I think split_blocks and self_destruct is the best answer at the >> moment. self_destruct has remained in the code since at least 1.0.0 >> so perhaps it is time we remove the "experimental" flag and maybe >> replace it with a "caution" or "danger" flag (as it causes the table >> to become unusable afterwards). >> >> Jemalloc has some manual facilities to purge dirty memory and we >> expose some of them with >> pyarrow.default_memory_pool().release_unused() but that doesn't seem >> to be helping in this situation. Either the excess memory is in the >> non-jemalloc pool or the jemalloc command can't quite release this >> memory, or the RSS stats are just stale. I'm not entirely sure. >> >> On Tue, Dec 7, 2021 at 11:54 AM Arun Joseph <[email protected]> wrote: >> > >> > Slightly related, I have some other code that opens up an arrow file using >> > a `pyarrow.ipc.RecordBatchFileReader` and then converts RecordBatch to a >> > pandas dataframe. After this conversion is done, and I inspect the heap, I >> > always see the following: >> > >> > hpy().heap() >> > Partition of a set of 351136 objects. Total size = 20112096840 bytes. >> > Index Count % Size % Cumulative % Kind (class / dict of class) >> > 0 121 0 9939601034 49 9939601034 49 numpy.ndarray >> > 1 1 0 9939585700 49 19879186734 99 >> > pandas.core.frame.DataFrame >> > 2 1 0 185786680 1 20064973414 100 >> > pandas.core.indexes.datetimes.DatetimeIndex >> > >> > Specifically the numpy.ndarray. It only shows up after the conversion and >> > it does not seem to go away. It also seems to be roughly the same size as >> > the dataframe itself. >> > >> > - Arun >> > >> > On Tue, Dec 7, 2021 at 10:21 AM Arun Joseph <[email protected]> wrote: >> >> >> >> Just to follow up on this, is there a way to manually force the arrow >> >> pool to de-allocate? My usecase is essentially having multiple processes >> >> in a Pool or via Slurm read from an arrow file, do some work, and then >> >> exit. Issue is that the 2x memory consumption reduces the bandwidth on >> >> the machine to effectively half. >> >> >> >> Thank You, >> >> Arun >> >> >> >> On Mon, Dec 6, 2021 at 10:38 AM Arun Joseph <[email protected]> wrote: >> >>> >> >>> Additionally, I tested with my actual data, and did not see memory >> >>> savings. >> >>> >> >>> On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <[email protected]> wrote: >> >>>> >> >>>> Hi Joris, >> >>>> >> >>>> Thank you for the explanation. The 2x memory consumption on conversion >> >>>> makes sense if there is a copy, but it does seem like it persists >> >>>> longer than it should. Might be because of python's GC policies? >> >>>> I tried out your recommendations but they did not seem to work. >> >>>> However, I did notice an experimental option on `to_pandas`, >> >>>> `self_destruct`, which seems to address the issue I'm facing. Sadly, >> >>>> that itself did not work either... but, combined with >> >>>> `split_blocks=True`, I am seeing memory savings: >> >>>> >> >>>> import pandas as pd >> >>>> import numpy as np >> >>>> import pyarrow as pa >> >>>> from pyarrow import feather >> >>>> import os >> >>>> import psutil >> >>>> pa.set_memory_pool(pa.system_memory_pool()) >> >>>> DATA_FILE = 'test.arrow' >> >>>> >> >>>> def setup(): >> >>>> np.random.seed(0) >> >>>> df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)), >> >>>> columns=list([f'{i}' for i in range(57)])) >> >>>> df.to_feather(DATA_FILE) >> >>>> print(f'wrote {DATA_FILE}') >> >>>> import sys >> >>>> sys.exit() >> >>>> >> >>>> if __name__ == "__main__": >> >>>> # setup() >> >>>> process = psutil.Process(os.getpid()) >> >>>> path = DATA_FILE >> >>>> >> >>>> mem_size = process.memory_info().rss / 1e9 >> >>>> print(f'BEFORE mem_size: {mem_size}gb') >> >>>> >> >>>> feather_table = feather.read_table(path) >> >>>> # df = feather_table.to_pandas(split_blocks=True) >> >>>> # df = feather_table.to_pandas() >> >>>> df = feather_table.to_pandas(self_destruct=True, split_blocks=True) >> >>>> >> >>>> mem_size = process.memory_info().rss / 1e9 >> >>>> df_size = df.memory_usage().sum() / 1e9 >> >>>> print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb') >> >>>> print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb') >> >>>> >> >>>> >> >>>> OUTPUT(to_pandas()): >> >>>> BEFORE mem_size: 0.091795456gb >> >>>> AFTER mem_size: 6.737887232gb df_size: 3.281625104gb >> >>>> ARROW: 3.281625024gb >> >>>> >> >>>> OUTPUT (to_pandas(split_blocks=True)): >> >>>> BEFORE mem_size: 0.091795456gb >> >>>> AFTER mem_size: 6.752907264gb df_size: 3.281625104gb >> >>>> ARROW: 3.281627712gb >> >>>> >> >>>> OUTPUT (to_pandas(self_destruct=True, split_blocks=True)): >> >>>> BEFORE mem_size: 0.091795456gb >> >>>> AFTER mem_size: 4.039512064gb df_size: 3.281625104gb >> >>>> ARROW: 3.281627712gb >> >>>> >> >>>> I'm guessing since this feature is experimental, it might either go >> >>>> away, or might have strange behaviors. Is there anything I should look >> >>>> out for, or is there some alternative to reproduce these results? >> >>>> >> >>>> Thank You, >> >>>> Arun >> >>>> >> >>>> On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche >> >>>> <[email protected]> wrote: >> >>>>> >> >>>>> Hi Aron, Weston, >> >>>>> >> >>>>> I didn't try running the script locally, but a quick note: the >> >>>>> `feather.read_feather` function reads the Feather file into an Arrow >> >>>>> table ànd directly converts it to a pandas DataFrame. A memory >> >>>>> consumption 2x the size of the dataframe sounds not that unexpected to >> >>>>> me: most of the time, when converting an arrow table to a pandas >> >>>>> DataFrame, the data will be copied to accommodate for pandas' specific >> >>>>> internal memory layout (at least numeric columns will be combined >> >>>>> together in 2D arrays). >> >>>>> >> >>>>> To verify if this is the cause, you might want to do either of: >> >>>>> - use `feather.read_table` instead of `feather.read_feather`, which >> >>>>> will read the file as an Arrow table instead (and don't do any >> >>>>> conversion to pandas) >> >>>>> - if you want to include the conversion to pandas, also use >> >>>>> `read_table` and do the conversion to pandas explicitly with a >> >>>>> `to_pandas()` call on the result. In that case, you can specify >> >>>>> `split_blocks=True` to use more zero-copy conversion in the >> >>>>> arrow->pandas conversion >> >>>>> >> >>>>> Joris >> >>>>> >> >>>>> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <[email protected]> wrote: >> >>>>> > >> >>>>> > Hi Wes, >> >>>>> > >> >>>>> > Sorry for the late reply on this, but I think I got a reproducible >> >>>>> > test case: >> >>>>> > >> >>>>> > import pandas as pd >> >>>>> > import numpy as np >> >>>>> > import pyarrow as pa >> >>>>> > from pyarrow import feather >> >>>>> > import os >> >>>>> > import psutil >> >>>>> > pa.set_memory_pool(pa.system_memory_pool()) >> >>>>> > DATA_FILE = 'test.arrow' >> >>>>> > >> >>>>> > def setup(): >> >>>>> > np.random.seed(0) >> >>>>> > df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)), >> >>>>> > columns=list([f'i_{i}' for i in range(57)])) >> >>>>> > df.to_feather(DATA_FILE) >> >>>>> > print(f'wrote {DATA_FILE}') >> >>>>> > import sys >> >>>>> > sys.exit() >> >>>>> > >> >>>>> > if __name__ == "__main__": >> >>>>> > # setup() >> >>>>> > process = psutil.Process(os.getpid()) >> >>>>> > path = DATA_FILE >> >>>>> > >> >>>>> > mem_size = process.memory_info().rss / 1e9 >> >>>>> > print(f'BEFORE mem_size: {mem_size}gb') >> >>>>> > >> >>>>> > df = feather.read_feather(path) >> >>>>> > >> >>>>> > mem_size = process.memory_info().rss / 1e9 >> >>>>> > df_size = df.memory_usage().sum() / 1e9 >> >>>>> > print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb') >> >>>>> > print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / >> >>>>> > 1e9}gb') >> >>>>> > >> >>>>> > OUTPUT: >> >>>>> > BEFORE mem_size: 0.091795456gb >> >>>>> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb >> >>>>> > ARROW: 3.281625024gb >> >>>>> > >> >>>>> > Let me know if you're able to see similar results. >> >>>>> > >> >>>>> > Thanks, >> >>>>> > Arun >> >>>>> > >> >>>>> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <[email protected]> >> >>>>> > wrote: >> >>>>> >> >> >>>>> >> I get more or less the same results as you for the provided setup >> >>>>> >> data >> >>>>> >> (exact same #'s for arrow & df_size and slightly different for RSS >> >>>>> >> which is to be expected). The fact that the arrow size is much >> >>>>> >> lower >> >>>>> >> than the dataframe size is not too surprising to me. If a column >> >>>>> >> can't be zero copied then it's memory will disappear from the arrow >> >>>>> >> pool (I think). Plus, object columns will have overhead in pandas >> >>>>> >> that they do not have in Arrow. >> >>>>> >> >> >>>>> >> The df_size issue for me seems to be tied to string columns. I >> >>>>> >> think >> >>>>> >> pandas is overestimating how much size is needed there (many of my >> >>>>> >> strings are similar and I wonder if some kind of object sharing is >> >>>>> >> happening). But we can table this for another time. >> >>>>> >> >> >>>>> >> I tried writing my feather file with your parameters and it didn't >> >>>>> >> have much impact on any of the numbers. >> >>>>> >> >> >>>>> >> Since the arrow size for you is expected (nearly the same as the >> >>>>> >> df_size) I'm not sure what to investigate next. The memory does not >> >>>>> >> seem to be retained by Arrow. Is there any chance you could create >> >>>>> >> a >> >>>>> >> reproducible test case using randomly generated numpy data (then you >> >>>>> >> could share that setup function)? >> >>>>> >> >> >>>>> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <[email protected]> >> >>>>> >> wrote: >> >>>>> >> > >> >>>>> >> > Hi Wes, >> >>>>> >> > >> >>>>> >> > I'm not including the setup() call when I encounter the issue. I >> >>>>> >> > just kept it in there for ease of reproducibility. Memory usage >> >>>>> >> > is indeed higher when it is included, but that isn't surprising. >> >>>>> >> > >> >>>>> >> > I tried switching over to the system allocator but there is no >> >>>>> >> > change. >> >>>>> >> > >> >>>>> >> > I've updated to Arrow 6.0.1 as well and there is no change. >> >>>>> >> > >> >>>>> >> > I updated my script to also include the Arrow bytes allocated and >> >>>>> >> > it gave me the following: >> >>>>> >> > >> >>>>> >> > MVE: >> >>>>> >> > import pandas as pd >> >>>>> >> > import pyarrow as pa >> >>>>> >> > from pyarrow import feather >> >>>>> >> > import os >> >>>>> >> > import psutil >> >>>>> >> > pa.set_memory_pool(pa.system_memory_pool()) >> >>>>> >> > >> >>>>> >> > >> >>>>> >> > def setup(): >> >>>>> >> > df = >> >>>>> >> > pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv') >> >>>>> >> > df.to_feather('test.csv') >> >>>>> >> > >> >>>>> >> > if __name__ == "__main__": >> >>>>> >> > # setup() >> >>>>> >> > process = psutil.Process(os.getpid()) >> >>>>> >> > path = 'test.csv' >> >>>>> >> > >> >>>>> >> > mem_size = process.memory_info().rss / 1e9 >> >>>>> >> > print(f'BEFORE mem_size: {mem_size}gb') >> >>>>> >> > >> >>>>> >> > df = feather.read_feather(path) >> >>>>> >> > >> >>>>> >> > df_size = df.memory_usage(deep=True).sum() / 1e9 >> >>>>> >> > mem_size = process.memory_info().rss / 1e10 >> >>>>> >> > print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb') >> >>>>> >> > print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / >> >>>>> >> > 1e9}gb') >> >>>>> >> > >> >>>>> >> > Output with my data: >> >>>>> >> > BEFORE mem_size: 0.08761344gb >> >>>>> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb >> >>>>> >> > ARROW: 3.080121792gb >> >>>>> >> > >> >>>>> >> > Output with Provided Setup Data: >> >>>>> >> > BEFORE mem_size: 0.09179136gb >> >>>>> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb >> >>>>> >> > ARROW: 0.00029664gb >> >>>>> >> > >> >>>>> >> > I'm assuming that the df and the arrow bytes allocated/sizes are >> >>>>> >> > distinct and non-overlapping, but it seems strange that the >> >>>>> >> > output with the provided data has the Arrow bytes allocated at >> >>>>> >> > ~0GB whereas the one with my data has the allocated data >> >>>>> >> > approximately equal to the dataframe size. I'm not sure if it >> >>>>> >> > affects anything but my file was written with the following: >> >>>>> >> > >> >>>>> >> > import pyarrow.lib as ext >> >>>>> >> > import pyarrow >> >>>>> >> > COMPRESSION_LEVEL = 19 >> >>>>> >> > COMPRESSION_ALGO = 'zstd' >> >>>>> >> > KILOBYTE = 1 << 10 >> >>>>> >> > MEGABYTE = KILOBYTE * KILOBYTE >> >>>>> >> > CHUNK_SIZE = MEGABYTE >> >>>>> >> > >> >>>>> >> > table = pyarrow.Table.from_pandas(df, >> >>>>> >> > preserve_index=preserve_index) >> >>>>> >> > ext.write_feather(table, dest, compression=compression, >> >>>>> >> > compression_level=compression_level,chunksize=chunk_size, >> >>>>> >> > version=2) >> >>>>> >> > >> >>>>> >> > As to the discrepancy around calculating dataframe size. I'm not >> >>>>> >> > sure why that would be so off for you. Going off the docs, it >> >>>>> >> > seems like it should be accurate. My Dataframe in question is >> >>>>> >> > [7196546 rows x 56 columns] where each column is mostly a float >> >>>>> >> > or integer and datetime index. 7196546 * 56 * 8 = 3224052608 ~= >> >>>>> >> > 3.2GB which roughly aligns. >> >>>>> >> > >> >>>>> >> > Thank You, >> >>>>> >> > Arun >> >>>>> >> > >> >>>>> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace >> >>>>> >> > <[email protected]> wrote: >> >>>>> >> >> >> >>>>> >> >> 2x overshoot of memory does seem a little high. Are you >> >>>>> >> >> including the >> >>>>> >> >> "setup" part when you encounter that? Arrow's file-based CSV >> >>>>> >> >> reader >> >>>>> >> >> will require 2-3x memory usage because it buffers the bytes in >> >>>>> >> >> memory >> >>>>> >> >> in case it needs to re-convert them later (because it realizes >> >>>>> >> >> the >> >>>>> >> >> data type for the column is different). I'm not sure if Panda's >> >>>>> >> >> CSV >> >>>>> >> >> reader is similar. >> >>>>> >> >> >> >>>>> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to >> >>>>> >> >> hold on >> >>>>> >> >> to a bit more memory and hold onto it (for a little while at >> >>>>> >> >> least) >> >>>>> >> >> even after it is no longer used. Even malloc will hold onto >> >>>>> >> >> memory >> >>>>> >> >> sometimes due to fragmentation or other concerns. You could try >> >>>>> >> >> changing to the system allocator >> >>>>> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your >> >>>>> >> >> file) >> >>>>> >> >> to see if that makes a difference. >> >>>>> >> >> >> >>>>> >> >> I'm not sure your method of calculating the dataframe size is >> >>>>> >> >> reliable. I don't actually know enough about pandas but when I >> >>>>> >> >> tried >> >>>>> >> >> your experiment with my own 1.9G CSV file it ended up reporting: >> >>>>> >> >> >> >>>>> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb >> >>>>> >> >> >> >>>>> >> >> which seems suspicious. >> >>>>> >> >> >> >>>>> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't >> >>>>> >> >> seem >> >>>>> >> >> all that unexpected. There was 2.348GB of usage. Arrow itself >> >>>>> >> >> was >> >>>>> >> >> only using ~1.9GB and I will naively assume the difference >> >>>>> >> >> between the >> >>>>> >> >> two is bloat caused by object wrappers when converting to pandas. >> >>>>> >> >> >> >>>>> >> >> Another thing you might try and measure is >> >>>>> >> >> `pa.default_memory_pool().bytes_allocated()`. This will tell >> >>>>> >> >> you how >> >>>>> >> >> much memory Arrow itself is hanging onto. If that is not 6GB >> >>>>> >> >> then it >> >>>>> >> >> is a pretty good guess that memory is being held somewhere else. >> >>>>> >> >> >> >>>>> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <[email protected]> >> >>>>> >> >> wrote: >> >>>>> >> >> > >> >>>>> >> >> > Hi Apache Arrow Members, >> >>>>> >> >> > >> >>>>> >> >> > My question is below but I've compiled a minimum reproducible >> >>>>> >> >> > example with a public dataset: >> >>>>> >> >> > >> >>>>> >> >> > import pandas as pd >> >>>>> >> >> > from pyarrow import feather >> >>>>> >> >> > import os >> >>>>> >> >> > import psutil >> >>>>> >> >> > >> >>>>> >> >> > >> >>>>> >> >> > def setup(): >> >>>>> >> >> > df = >> >>>>> >> >> > pd.read_csv('https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv') >> >>>>> >> >> > df.to_feather('test.csv') >> >>>>> >> >> > >> >>>>> >> >> > if __name__ == "__main__": >> >>>>> >> >> > # setup() >> >>>>> >> >> > process = psutil.Process(os.getpid()) >> >>>>> >> >> > path = 'test.csv' >> >>>>> >> >> > >> >>>>> >> >> > mem_size = process.memory_info().rss / 1e9 >> >>>>> >> >> > print(f'BEFORE mem_size: {mem_size}gb') >> >>>>> >> >> > >> >>>>> >> >> > df = feather.read_feather(path) >> >>>>> >> >> > >> >>>>> >> >> > df_size = df.memory_usage(deep=True).sum() / 1e9 >> >>>>> >> >> > mem_size = process.memory_info().rss / 1e9 >> >>>>> >> >> > print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb') >> >>>>> >> >> > >> >>>>> >> >> > I substituted my df with a sample csv. I had trouble finding a >> >>>>> >> >> > sample CSV of adequate size however, my dataset is ~3GB, and I >> >>>>> >> >> > see memory usage of close to 6GB. >> >>>>> >> >> > >> >>>>> >> >> > Output with My Data: >> >>>>> >> >> > BEFORE mem_size: 0.088891392gb >> >>>>> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb >> >>>>> >> >> > >> >>>>> >> >> > It seems strange that the overall memory usage of the process >> >>>>> >> >> > is approx double of the size of the dataframe itself. Is there >> >>>>> >> >> > a reason for this, and is there a way to mitigate this? >> >>>>> >> >> > >> >>>>> >> >> > $ conda list pyarrow >> >>>>> >> >> > # >> >>>>> >> >> > # Name Version Build >> >>>>> >> >> > Channel >> >>>>> >> >> > pyarrow 4.0.1 py37h0f64622_13_cpu >> >>>>> >> >> > conda-forge >> >>>>> >> >> > >> >>>>> >> >> > Thank You, >> >>>>> >> >> > Arun Joseph >> >>>>> >> >> > >> >>>>> >> > >> >>>>> >> > >> >>>>> >> > >> >>>>> >> > -- >> >>>>> >> > Arun Joseph >> >>>>> >> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > -- >> >>>>> > Arun Joseph >> >>>>> > >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Arun Joseph >> >>>> >> >>> >> >>> >> >>> -- >> >>> Arun Joseph >> >>> >> >> >> >> >> >> -- >> >> Arun Joseph >> >> >> > >> > >> > -- >> > Arun Joseph >> > >> >> > > > -- > Arun Joseph >
