[ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599104#comment-17599104
 ] 

Yin edited comment on ARROW-17590 at 9/1/22 7:07 PM:
-----------------------------------------------------

Hi Weston,  Just saw your comment. Will try it in the sample code. Thanks 

Update:

Printed out pyarrow.total_allocated_bytes and table.nbytes.
Below is in the updated sample code.

In the case A: reading all columns with the filter, 
total_allocated_bytes is 289 MB and dt.nbytes is very small.

case B reads one column with the filter.
case C reads all columns without filter.

# total_allocated_bytes  and table.nbytes
# pyarrow 7.0.0, pandas 1.4.4 numpy 1.23.2
# A: 289 MB 0.00115 MB B: 3.5 MB 9.53-e06 MB C: 289.72 MB 288.38 MB
# pyarrow 9.0.0, pandas 1.4.4 numpy 1.23.2
# A: 289 MB 0.0014 MB B: 3.5 MB 1.049-e05 MB C: 289.72 MB 288.38 MB

# rss memory after read_table
# pyarrow 7.0.0, pandas 1.4.4 numpy 1.23.2
# A: 1008 MB B: 88 MB C: 1008 MB
# pyarrow 9.0.0 pandas 1.4.4 numpy 1.23.2
# A: 394 MB B: 85 MB C: 393 MB


was (Author: JIRAUSER285415):
Hi Weston,  Just saw your comment. Will try it in the sample code. Thanks 

> Lower memory usage with filters
> -------------------------------
>
>                 Key: ARROW-17590
>                 URL: https://issues.apache.org/jira/browse/ARROW-17590
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Yin
>            Priority: Major
>         Attachments: sample-1.py, sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to