Additionally, I tested with my actual data, and did not see memory savings.
On Mon, Dec 6, 2021 at 10:35 AM Arun Joseph <[email protected]> wrote: > Hi Joris, > > Thank you for the explanation. The 2x memory consumption on conversion > makes sense if there is a copy, but it does seem like it persists longer > than it should. Might be because of python's GC policies? > I tried out your recommendations but they did not seem to work. However, I > did notice an experimental option on `to_pandas`, `self_destruct`, which > seems to address the issue I'm facing. Sadly, that itself did not work > either... but, combined with `split_blocks=True`, I am seeing memory > savings: > > import pandas as pd > import numpy as np > import pyarrow as pa > from pyarrow import feather > import os > import psutil > pa.set_memory_pool(pa.system_memory_pool()) > DATA_FILE = 'test.arrow' > > def setup(): > np.random.seed(0) > df = pd.DataFrame(np.random.randint(0,100,size=(7196546, 57)), > columns=list([f'{i}' for i in range(57)])) > df.to_feather(DATA_FILE) > print(f'wrote {DATA_FILE}') > import sys > sys.exit() > > if __name__ == "__main__": > # setup() > process = psutil.Process(os.getpid()) > path = DATA_FILE > > mem_size = process.memory_info().rss / 1e9 > print(f'BEFORE mem_size: {mem_size}gb') > > feather_table = feather.read_table(path) > # df = feather_table.to_pandas(split_blocks=True) > # df = feather_table.to_pandas() > df = feather_table.to_pandas(self_destruct=True, split_blocks=True) > > mem_size = process.memory_info().rss / 1e9 > df_size = df.memory_usage().sum() / 1e9 > print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb') > print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb') > > > OUTPUT(to_pandas()): > BEFORE mem_size: 0.091795456gb > AFTER mem_size: 6.737887232gb df_size: 3.281625104gb > ARROW: 3.281625024gb > > OUTPUT (to_pandas(split_blocks=True)): > BEFORE mem_size: 0.091795456gb > AFTER mem_size: 6.752907264gb df_size: 3.281625104gb > ARROW: 3.281627712gb > > OUTPUT (to_pandas(self_destruct=True, split_blocks=True)): > BEFORE mem_size: 0.091795456gb > AFTER mem_size: 4.039512064gb df_size: 3.281625104gb > ARROW: 3.281627712gb > > I'm guessing since this feature is experimental, it might either go away, > or might have strange behaviors. Is there anything I should look out for, > or is there some alternative to reproduce these results? > > Thank You, > Arun > > On Mon, Dec 6, 2021 at 10:07 AM Joris Van den Bossche < > [email protected]> wrote: > >> Hi Aron, Weston, >> >> I didn't try running the script locally, but a quick note: the >> `feather.read_feather` function reads the Feather file into an Arrow >> table ànd directly converts it to a pandas DataFrame. A memory >> consumption 2x the size of the dataframe sounds not that unexpected to >> me: most of the time, when converting an arrow table to a pandas >> DataFrame, the data will be copied to accommodate for pandas' specific >> internal memory layout (at least numeric columns will be combined >> together in 2D arrays). >> >> To verify if this is the cause, you might want to do either of: >> - use `feather.read_table` instead of `feather.read_feather`, which >> will read the file as an Arrow table instead (and don't do any >> conversion to pandas) >> - if you want to include the conversion to pandas, also use >> `read_table` and do the conversion to pandas explicitly with a >> `to_pandas()` call on the result. In that case, you can specify >> `split_blocks=True` to use more zero-copy conversion in the >> arrow->pandas conversion >> >> Joris >> >> On Mon, 6 Dec 2021 at 15:05, Arun Joseph <[email protected]> wrote: >> > >> > Hi Wes, >> > >> > Sorry for the late reply on this, but I think I got a reproducible test >> case: >> > >> > import pandas as pd >> > import numpy as np >> > import pyarrow as pa >> > from pyarrow import feather >> > import os >> > import psutil >> > pa.set_memory_pool(pa.system_memory_pool()) >> > DATA_FILE = 'test.arrow' >> > >> > def setup(): >> > np.random.seed(0) >> > df = pd.DataFrame(np.random.uniform(0,100,size=(7196546, 57)), >> columns=list([f'i_{i}' for i in range(57)])) >> > df.to_feather(DATA_FILE) >> > print(f'wrote {DATA_FILE}') >> > import sys >> > sys.exit() >> > >> > if __name__ == "__main__": >> > # setup() >> > process = psutil.Process(os.getpid()) >> > path = DATA_FILE >> > >> > mem_size = process.memory_info().rss / 1e9 >> > print(f'BEFORE mem_size: {mem_size}gb') >> > >> > df = feather.read_feather(path) >> > >> > mem_size = process.memory_info().rss / 1e9 >> > df_size = df.memory_usage().sum() / 1e9 >> > print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb') >> > print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / 1e9}gb') >> > >> > OUTPUT: >> > BEFORE mem_size: 0.091795456gb >> > AFTER mem_size: 6.762156032gb df_size: 3.281625104gb >> > ARROW: 3.281625024gb >> > >> > Let me know if you're able to see similar results. >> > >> > Thanks, >> > Arun >> > >> > On Fri, Dec 3, 2021 at 6:03 PM Weston Pace <[email protected]> >> wrote: >> >> >> >> I get more or less the same results as you for the provided setup data >> >> (exact same #'s for arrow & df_size and slightly different for RSS >> >> which is to be expected). The fact that the arrow size is much lower >> >> than the dataframe size is not too surprising to me. If a column >> >> can't be zero copied then it's memory will disappear from the arrow >> >> pool (I think). Plus, object columns will have overhead in pandas >> >> that they do not have in Arrow. >> >> >> >> The df_size issue for me seems to be tied to string columns. I think >> >> pandas is overestimating how much size is needed there (many of my >> >> strings are similar and I wonder if some kind of object sharing is >> >> happening). But we can table this for another time. >> >> >> >> I tried writing my feather file with your parameters and it didn't >> >> have much impact on any of the numbers. >> >> >> >> Since the arrow size for you is expected (nearly the same as the >> >> df_size) I'm not sure what to investigate next. The memory does not >> >> seem to be retained by Arrow. Is there any chance you could create a >> >> reproducible test case using randomly generated numpy data (then you >> >> could share that setup function)? >> >> >> >> On Fri, Dec 3, 2021 at 12:13 PM Arun Joseph <[email protected]> wrote: >> >> > >> >> > Hi Wes, >> >> > >> >> > I'm not including the setup() call when I encounter the issue. I >> just kept it in there for ease of reproducibility. Memory usage is indeed >> higher when it is included, but that isn't surprising. >> >> > >> >> > I tried switching over to the system allocator but there is no >> change. >> >> > >> >> > I've updated to Arrow 6.0.1 as well and there is no change. >> >> > >> >> > I updated my script to also include the Arrow bytes allocated and it >> gave me the following: >> >> > >> >> > MVE: >> >> > import pandas as pd >> >> > import pyarrow as pa >> >> > from pyarrow import feather >> >> > import os >> >> > import psutil >> >> > pa.set_memory_pool(pa.system_memory_pool()) >> >> > >> >> > >> >> > def setup(): >> >> > df = pd.read_csv(' >> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv >> ') >> >> > df.to_feather('test.csv') >> >> > >> >> > if __name__ == "__main__": >> >> > # setup() >> >> > process = psutil.Process(os.getpid()) >> >> > path = 'test.csv' >> >> > >> >> > mem_size = process.memory_info().rss / 1e9 >> >> > print(f'BEFORE mem_size: {mem_size}gb') >> >> > >> >> > df = feather.read_feather(path) >> >> > >> >> > df_size = df.memory_usage(deep=True).sum() / 1e9 >> >> > mem_size = process.memory_info().rss / 1e10 >> >> > print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb') >> >> > print(f'ARROW: {pa.default_memory_pool().bytes_allocated() / >> 1e9}gb') >> >> > >> >> > Output with my data: >> >> > BEFORE mem_size: 0.08761344gb >> >> > AFTER mem_size: 6.297198592gb df_size: 3.080121688gb >> >> > ARROW: 3.080121792gb >> >> > >> >> > Output with Provided Setup Data: >> >> > BEFORE mem_size: 0.09179136gb >> >> > AFTER mem_size: 0.011487232gb df_size: 0.024564664gb >> >> > ARROW: 0.00029664gb >> >> > >> >> > I'm assuming that the df and the arrow bytes allocated/sizes are >> distinct and non-overlapping, but it seems strange that the output with the >> provided data has the Arrow bytes allocated at ~0GB whereas the one with my >> data has the allocated data approximately equal to the dataframe size. I'm >> not sure if it affects anything but my file was written with the following: >> >> > >> >> > import pyarrow.lib as ext >> >> > import pyarrow >> >> > COMPRESSION_LEVEL = 19 >> >> > COMPRESSION_ALGO = 'zstd' >> >> > KILOBYTE = 1 << 10 >> >> > MEGABYTE = KILOBYTE * KILOBYTE >> >> > CHUNK_SIZE = MEGABYTE >> >> > >> >> > table = pyarrow.Table.from_pandas(df, preserve_index=preserve_index) >> >> > ext.write_feather(table, dest, compression=compression, >> compression_level=compression_level,chunksize=chunk_size, version=2) >> >> > >> >> > As to the discrepancy around calculating dataframe size. I'm not >> sure why that would be so off for you. Going off the docs, it seems like it >> should be accurate. My Dataframe in question is [7196546 rows x 56 columns] >> where each column is mostly a float or integer and datetime index. 7196546 >> * 56 * 8 = 3224052608 ~= 3.2GB which roughly aligns. >> >> > >> >> > Thank You, >> >> > Arun >> >> > >> >> > On Fri, Dec 3, 2021 at 4:36 PM Weston Pace <[email protected]> >> wrote: >> >> >> >> >> >> 2x overshoot of memory does seem a little high. Are you including >> the >> >> >> "setup" part when you encounter that? Arrow's file-based CSV reader >> >> >> will require 2-3x memory usage because it buffers the bytes in >> memory >> >> >> in case it needs to re-convert them later (because it realizes the >> >> >> data type for the column is different). I'm not sure if Panda's CSV >> >> >> reader is similar. >> >> >> >> >> >> Dynamic memory allocators (e.g. jemalloc) can cause Arrow to hold on >> >> >> to a bit more memory and hold onto it (for a little while at least) >> >> >> even after it is no longer used. Even malloc will hold onto memory >> >> >> sometimes due to fragmentation or other concerns. You could try >> >> >> changing to the system allocator >> >> >> (pa.set_memory_pool(pa.system_memory_pool()) at the top of your >> file) >> >> >> to see if that makes a difference. >> >> >> >> >> >> I'm not sure your method of calculating the dataframe size is >> >> >> reliable. I don't actually know enough about pandas but when I >> tried >> >> >> your experiment with my own 1.9G CSV file it ended up reporting: >> >> >> >> >> >> AFTER mem_size: 2.348068864gb df_size: 4.519898461gb >> >> >> >> >> >> which seems suspicious. >> >> >> >> >> >> Anyways, my tests with my own CSV file (on Arrow 6.0.1) didn't seem >> >> >> all that unexpected. There was 2.348GB of usage. Arrow itself was >> >> >> only using ~1.9GB and I will naively assume the difference between >> the >> >> >> two is bloat caused by object wrappers when converting to pandas. >> >> >> >> >> >> Another thing you might try and measure is >> >> >> `pa.default_memory_pool().bytes_allocated()`. This will tell you >> how >> >> >> much memory Arrow itself is hanging onto. If that is not 6GB then >> it >> >> >> is a pretty good guess that memory is being held somewhere else. >> >> >> >> >> >> On Fri, Dec 3, 2021 at 10:54 AM Arun Joseph <[email protected]> >> wrote: >> >> >> > >> >> >> > Hi Apache Arrow Members, >> >> >> > >> >> >> > My question is below but I've compiled a minimum reproducible >> example with a public dataset: >> >> >> > >> >> >> > import pandas as pd >> >> >> > from pyarrow import feather >> >> >> > import os >> >> >> > import psutil >> >> >> > >> >> >> > >> >> >> > def setup(): >> >> >> > df = pd.read_csv(' >> https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-csv.csv >> ') >> >> >> > df.to_feather('test.csv') >> >> >> > >> >> >> > if __name__ == "__main__": >> >> >> > # setup() >> >> >> > process = psutil.Process(os.getpid()) >> >> >> > path = 'test.csv' >> >> >> > >> >> >> > mem_size = process.memory_info().rss / 1e9 >> >> >> > print(f'BEFORE mem_size: {mem_size}gb') >> >> >> > >> >> >> > df = feather.read_feather(path) >> >> >> > >> >> >> > df_size = df.memory_usage(deep=True).sum() / 1e9 >> >> >> > mem_size = process.memory_info().rss / 1e9 >> >> >> > print(f'AFTER mem_size: {mem_size}gb df_size: {df_size}gb') >> >> >> > >> >> >> > I substituted my df with a sample csv. I had trouble finding a >> sample CSV of adequate size however, my dataset is ~3GB, and I see memory >> usage of close to 6GB. >> >> >> > >> >> >> > Output with My Data: >> >> >> > BEFORE mem_size: 0.088891392gb >> >> >> > AFTER mem_size: 6.324678656gb df_size: 3.080121688gb >> >> >> > >> >> >> > It seems strange that the overall memory usage of the process is >> approx double of the size of the dataframe itself. Is there a reason for >> this, and is there a way to mitigate this? >> >> >> > >> >> >> > $ conda list pyarrow >> >> >> > # >> >> >> > # Name Version Build Channel >> >> >> > pyarrow 4.0.1 py37h0f64622_13_cpu >> conda-forge >> >> >> > >> >> >> > Thank You, >> >> >> > Arun Joseph >> >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Arun Joseph >> >> > >> > >> > >> > >> > -- >> > Arun Joseph >> > >> > > > -- > Arun Joseph > > -- Arun Joseph
