[ https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599104#comment-17599104 ]
Yin edited comment on ARROW-17590 at 9/1/22 7:07 PM: ----------------------------------------------------- Hi Weston, Just saw your comment. Will try it in the sample code. Thanks Update: Printed out pyarrow.total_allocated_bytes and table.nbytes. Below is in the updated sample code. In the case A: reading all columns with the filter, total_allocated_bytes is 289 MB and dt.nbytes is very small. case B reads one column with the filter. case C reads all columns without filter. # total_allocated_bytes and table.nbytes # pyarrow 7.0.0, pandas 1.4.4 numpy 1.23.2 # A: 289 MB 0.00115 MB B: 3.5 MB 9.53-e06 MB C: 289.72 MB 288.38 MB # pyarrow 9.0.0, pandas 1.4.4 numpy 1.23.2 # A: 289 MB 0.0014 MB B: 3.5 MB 1.049-e05 MB C: 289.72 MB 288.38 MB # rss memory after read_table # pyarrow 7.0.0, pandas 1.4.4 numpy 1.23.2 # A: 1008 MB B: 88 MB C: 1008 MB # pyarrow 9.0.0 pandas 1.4.4 numpy 1.23.2 # A: 394 MB B: 85 MB C: 393 MB was (Author: JIRAUSER285415): Hi Weston, Just saw your comment. Will try it in the sample code. Thanks > Lower memory usage with filters > ------------------------------- > > Key: ARROW-17590 > URL: https://issues.apache.org/jira/browse/ARROW-17590 > Project: Apache Arrow > Issue Type: Improvement > Reporter: Yin > Priority: Major > Attachments: sample-1.py, sample.py > > > Hi, > When I read a parquet file (about 23MB with 250K rows and 600 object/string > columns with lots of None) with filter on a not null column for a small > number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB > to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 > rows 20MB). Looks like it scans/loads many rows from the parquet file. Not > only the footprint or watermark of memory usage is high, but also it seems > not releasing the memory in time (such as after GC in Python, but may get > used for subsequent read). > When reading the same parquet file for all columns without filtering, the > memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas > dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting > something. > It helps to limit the number of columns read. Read 1 column with filter for 1 > row or more or without filter, it takes about 10MB, which is quite smaller > and better, but still bigger than the size of table or data frame with 1 or > 500 rows of 1 columns (under 1MB) > The filtered column is not a partition key, which functionally works to get > the correct rows. But the memory usage is quite high even when the parquet > file is not really large, partitioned or not. There were some references > similar to this issue, for example: > [https://github.com/apache/arrow/issues/7338] > Related classes/methods in (pyarrow 9.0.0) > _ParquetDatasetV2.read > self._dataset.to_table(columns=columns, filter=self._filter_expression, > use_threads=use_threads) > pyarrow._dataset.FileSystemDatase.to_table > I played with pyarrow._dataset.Scanner.to_table > self._dataset.scanner(columns=columns, > filter=self._filter_expression).to_table() > The memory usage is small to construct the scanner but then goes up after the > to_table call materializes it. > Is there some way or workaround to reduce the memory usage with read > filtering? > If not supported yet, can it be fixed/improved with priority? > This is a blocking issue for us when we need to load all or many columns. > I am not sure what improvement is possible with respect to how the parquet > columnar format works, and if it can be patched somehow in the Pyarrow Python > code, or need to change and build the arrow C++ code. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010)