[ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:
------------------------
    Description: 
Hi,
When I read a parquet file (about 23MB with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file for all columns without filtering, the 
memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
something.

It helps to limit the number of columns read. Read 1 column with filter for 1 
row or more or without filter, it takes about 10MB, which is quite smaller and 
better, but still bigger than the size of table or data frame with 1 or 500 
rows of 1 columns (under 1MB)

The filtered column is not a partition key, which functionally works to get the 
correct rows. But the memory usage is quite high even when the parquet file is 
not really large, partitioned or not. There were some references similar to 
this issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with read filtering? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us when we need to load all or many columns. 
I am not sure what improvement is possible with respect to how the parquet 
columnar format works, and if it can be patched somehow in the Pyarrow Python 
code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23MB with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file for all columns without filtering, the 
memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
something.

It helps to limit the number of columns read. Read one column with filter for 1 
row or more or without filter, it takes about 100MB. 



The filtered column is not a partition key, which functionally works to get the 
correct rows. But the memory usage is quite high even when the parquet file is 
not really large, partitioned or not. There were some references similar to 
this issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with read filtering? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us when we need to load all or many columns. 
I am not sure what improvement is possible with respect to how the parquet 
columnar format works, and if it can be patched somehow in the Pyarrow Python 
code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> -------------------------------
>
>                 Key: ARROW-17590
>                 URL: https://issues.apache.org/jira/browse/ARROW-17590
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Yin
>            Priority: Major
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to