[ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:
------------------------
    Description: 
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!


> Lower memory usage with filters
> -------------------------------
>
>                 Key: ARROW-17590
>                 URL: https://issues.apache.org/jira/browse/ARROW-17590
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Yin
>            Priority: Major
>
> Hi,
> When I read a large parquet file with filter for a small number of rows, the 
> memory usage is pretty high. The result table and dataframe have only a few 
> rows. Looks like it scans/loads many rows from the parquet file. Not only the 
> footprint or watermark of memory usage is high, but also it seems not 
> releasing the memory in time (such as after GC in Python, but may get used 
> for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with filters? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us. I don't know what may be involved with 
> respect to the parquet columnar format, and if it can be patched somehow in 
> the Pyarrow Python code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to