[jira] [Commented] (ARROW-17590) Lower memory usage with filters

Yin (Jira) Fri, 02 Sep 2022 06:19:36 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599489#comment-17599489
 ]


Yin commented on ARROW-17590:
-----------------------------

Yep,
pa.total_allocated_bytes 289.74639892578125 MB dt.nbytes 0.0011539459228515625 
MB
sleep  5 seconds
pa.total_allocated_bytes 0.0184326171875 MB dt.nbytes 0.0011539459228515625 MB

Thanks Weston. Let me close this jira.

> Lower memory usage with filters
> -------------------------------
>
>                 Key: ARROW-17590
>                 URL: https://issues.apache.org/jira/browse/ARROW-17590
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Yin
>            Priority: Major
>         Attachments: sample-1.py, sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17590) Lower memory usage with filters

Reply via email to