[jira] [Closed] (ARROW-17590) Lower memory usage with filters

2022-09-02 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin closed ARROW-17590.
---
Resolution: Duplicate

> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
> Attachments: sample-1.py, sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17590) Lower memory usage with filters

2022-09-02 Thread Yin (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599489#comment-17599489
 ] 

Yin commented on ARROW-17590:
-

Yep,
pa.total_allocated_bytes 289.74639892578125 MB dt.nbytes 0.0011539459228515625 
MB
sleep  5 seconds
pa.total_allocated_bytes 0.0184326171875 MB dt.nbytes 0.0011539459228515625 MB

Thanks Weston. Let me close this jira.

> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
> Attachments: sample-1.py, sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-09-01 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Attachment: sample-1.py

> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
> Attachments: sample-1.py, sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-17590) Lower memory usage with filters

2022-09-01 Thread Yin (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599104#comment-17599104
 ] 

Yin edited comment on ARROW-17590 at 9/1/22 7:07 PM:
-

Hi Weston,  Just saw your comment. Will try it in the sample code. Thanks 

Update:

Printed out pyarrow.total_allocated_bytes and table.nbytes.
Below is in the updated sample code.

In the case A: reading all columns with the filter, 
total_allocated_bytes is 289 MB and dt.nbytes is very small.

case B reads one column with the filter.
case C reads all columns without filter.

# total_allocated_bytes  and table.nbytes
# pyarrow 7.0.0, pandas 1.4.4 numpy 1.23.2
# A: 289 MB 0.00115 MB B: 3.5 MB 9.53-e06 MB C: 289.72 MB 288.38 MB
# pyarrow 9.0.0, pandas 1.4.4 numpy 1.23.2
# A: 289 MB 0.0014 MB B: 3.5 MB 1.049-e05 MB C: 289.72 MB 288.38 MB

# rss memory after read_table
# pyarrow 7.0.0, pandas 1.4.4 numpy 1.23.2
# A: 1008 MB B: 88 MB C: 1008 MB
# pyarrow 9.0.0 pandas 1.4.4 numpy 1.23.2
# A: 394 MB B: 85 MB C: 393 MB


was (Author: JIRAUSER285415):
Hi Weston,  Just saw your comment. Will try it in the sample code. Thanks 

> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
> Attachments: sample-1.py, sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17590) Lower memory usage with filters

2022-09-01 Thread Yin (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599104#comment-17599104
 ] 

Yin commented on ARROW-17590:
-

Hi Weston,  Just saw your comment. Will try it in the sample code. Thanks 

> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
> Attachments: sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17590) Lower memory usage with filters

2022-09-01 Thread Yin (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599102#comment-17599102
 ] 

Yin commented on ARROW-17590:
-

Hi Will, 
I am using pandas.read_parquet, which goes to pyarrow.parquet.read_table.
Tried pre_buffer=False, and saw no difference.
I will try to use iter_batches and filtering.

Attached the sample code [^sample.py], with memory stats on a Windows machine.
Used Pyarrow 9.0.0 (and 7.0.0) and the latest pandas 1.4.4 and numpy 1.23.2. 
Saw that Pyarrow 9.0.0 improved read_table and saved some memory, but to_pandas 
is still about the same.
The main question here is why read_table uses the same amount memory when loads 
all columns with filtering similar as without filtering.


By the way, wonder if memory allocation for repeating strings (e.g. empty) and 
None can be more efficient in to_pandas as well, without requiring use 
Categorical columns explicitly.

Thanks,
-Yin

> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
> Attachments: sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-09-01 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Attachment: sample.py

> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
> Attachments: sample.py
>
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file for all columns without filtering, the 
> memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
> dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
> something.
> It helps to limit the number of columns read. Read 1 column with filter for 1 
> row or more or without filter, it takes about 10MB, which is quite smaller 
> and better, but still bigger than the size of table or data frame with 1 or 
> 500 rows of 1 columns (under 1MB)
> The filtered column is not a partition key, which functionally works to get 
> the correct rows. But the memory usage is quite high even when the parquet 
> file is not really large, partitioned or not. There were some references 
> similar to this issue, for example: 
> [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with read 
> filtering? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us when we need to load all or many columns. 
> I am not sure what improvement is possible with respect to how the parquet 
> columnar format works, and if it can be patched somehow in the Pyarrow Python 
> code, or need to change and build the arrow C++ code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-09-01 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23MB with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file for all columns without filtering, the 
memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
something.

It helps to limit the number of columns read. Read 1 column with filter for 1 
row or more or without filter, it takes about 10MB, which is quite smaller and 
better, but still bigger than the size of table or data frame with 1 or 500 
rows of 1 columns (under 1MB)

The filtered column is not a partition key, which functionally works to get the 
correct rows. But the memory usage is quite high even when the parquet file is 
not really large, partitioned or not. There were some references similar to 
this issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with read filtering? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us when we need to load all or many columns. 
I am not sure what improvement is possible with respect to how the parquet 
columnar format works, and if it can be patched somehow in the Pyarrow Python 
code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23MB with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file for all columns without filtering, the 
memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
something.

It helps to limit the number of columns read. Read one column with filter for 1 
row or more or without filter, it takes about 100MB. 



The filtered column is not a partition key, which functionally works to get the 
correct rows. But the memory usage is quite high even when the parquet file is 
not really large, partitioned or not. There were some references similar to 
this issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with read filtering? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us when we need to load all or many columns. 
I am not sure what improvement is possible with respect to how the parquet 
columnar format works, and if it can be patched somehow in the Pyarrow Python 
code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23MB 

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-09-01 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23MB with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file for all columns without filtering, the 
memory usage is about the same at 900MB. It goes up to 2.3GB after to_pandas 
dataframe,. df.info(memory_usage='deep') shows 4.3GB maybe double counting 
something.

It helps to limit the number of columns read. Read one column with filter for 1 
row or more or without filter, it takes about 100MB. 



The filtered column is not a partition key, which functionally works to get the 
correct rows. But the memory usage is quite high even when the parquet file is 
not really large, partitioned or not. There were some references similar to 
this issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with read filtering? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us when we need to load all or many columns. 
I am not sure what improvement is possible with respect to how the parquet 
columnar format works, and if it can be patched somehow in the Pyarrow Python 
code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23MB with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file without filtering, the memory usage is about 
the same at 900MB. It goes up to 2.3GB after to_pandas dataframe,. 
df.info(memory_usage='deep') shows 4.3GB maybe double counting something.

The filtered column is not a partition key, which functionally works to get the 
correct rows. But the memory usage is quite high even when the parquet file is 
not really large, partitioned or not. There were some references similar to 
this issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with read filtering? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet fil

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23MB with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file without filtering, the memory usage is about 
the same at 900MB. It goes up to 2.3GB after to_pandas dataframe,. 
df.info(memory_usage='deep') shows 4.3GB maybe double counting something.

The filtered column is not a partition key, which functionally works to get the 
correct rows. But the memory usage is quite high even when the parquet file is 
not really large, partitioned or not. There were some references similar to 
this issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with read filtering? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file without filtering, the memory usage is about 
the same at 900MB, and goes up to 2.3GB after to_pandas dataframe,. 
df.info(memory_usage='deep') shows 4.3GB maybe double counting something.

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23MB with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file without filte

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

When reading the same parquet file without filtering, the memory usage is about 
the same at 900MB, and goes up to 2.3GB after to_pandas dataframe,. 
df.info(memory_usage='deep') shows 4.3GB maybe double counting something.

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23mb with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> When reading the same parquet file without filtering, the memory usage is 
> about the same at 900MB, and goes up to 2.3GB after to_pandas dataframe,. 
> df.info(memory_usage='deep') shows 4.3GB maybe double counting something.
> The filtered column is not a partition key, which functionally works 

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB to 1GB). 
The result table and dataframe have only a few rows (1 row 20kb, 500 rows 
20MB). Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (close to 1GB). The 
result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). 
Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23mb with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (around 900MB 
> to 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played wit

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter on a not null column for a small number 
of rows (e.g. 1 to 500), the memory usage is pretty high (close to 1GB). The 
result table and dataframe have only a few rows (1 row 20kb, 500 rows 20MB). 
Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter for a small number of rows (e.g. 1 to 
500), the memory usage is pretty high (close to 1GB). The result table and 
dataframe have only a few rows (at about 20MB). Looks like it scans/loads many 
rows from the parquet file. Not only the footprint or watermark of memory usage 
is high, but also it seems not releasing the memory in time (such as after GC 
in Python, but may get used for subsequent read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23mb with 250K rows and 600 object/string 
> columns with lots of None) with filter on a not null column for a small 
> number of rows (e.g. 1 to 500), the memory usage is pretty high (close to 
> 1GB). The result table and dataframe have only a few rows (1 row 20kb, 500 
> rows 20MB). Looks like it scans/loads many rows from the parquet file. Not 
> only the footprint or watermark of memory usage is high, but also it seems 
> not releasing the memory in time (such as after GC in Python, but may get 
> used for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a parquet file (about 23mb with 250K rows and 600 object/string 
columns with lots of None) with filter for a small number of rows (e.g. 1 to 
500), the memory usage is pretty high (close to 1GB). The result table and 
dataframe have only a few rows (at about 20MB). Looks like it scans/loads many 
rows from the parquet file. Not only the footprint or watermark of memory usage 
is high, but also it seems not releasing the memory in time (such as after GC 
in Python, but may get used for subsequent read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a parquet file (about 23mb with 250K rows and 600 object/string 
> columns with lots of None) with filter for a small number of rows (e.g. 1 to 
> 500), the memory usage is pretty high (close to 1GB). The result table and 
> dataframe have only a few rows (at about 20MB). Looks like it scans/loads 
> many rows from the parquet file. Not only the footprint or watermark of 
> memory usage is high, but also it seems not releasing the memory in time 
> (such as after GC in Python, but may get used for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't know what may be involved with respect 
to the parquet columnar format, and if it can be patched somehow in the Pyarrow 
Python code, or need to change and build the arrow C++ code.

Thanks!

  was:
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a large parquet file with filter for a small number of rows, the 
> memory usage is pretty high. The result table and dataframe have only a few 
> rows. Looks like it scans/loads many rows from the parquet file. Not only the 
> footprint or watermark of memory usage is high, but also it seems not 
> releasing the memory in time (such as after GC in Python, but may get used 
> for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with filters? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us. I don't know what may be involved with 
> respect to the parquet columnar format, and if it can be patched somehow in 
> the Pyarrow Python code, or need to change and build the arrow C++ code.

[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!

  was:
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: 
[https://github.com/apache/arrow/issues/7338|https://github.com/apache/arrow/issues/7338.]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a large parquet file with filter for a small number of rows, the 
> memory usage is pretty high. The result table and dataframe have only a few 
> rows. Looks like it scans/loads many rows from the parquet file. Not only the 
> footprint or watermark of memory usage is high, but also it seems not 
> releasing the memory in time (such as after GC in Python, but may get used 
> for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: [https://github.com/apache/arrow/issues/7338]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with filters? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us. I don't see if it can be patched in the 
> Python code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-17590:

Description: 
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: 
[https://github.com/apache/arrow/issues/7338|https://github.com/apache/arrow/issues/7338.]

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!

  was:
Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338.] 

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!


> Lower memory usage with filters
> ---
>
> Key: ARROW-17590
> URL: https://issues.apache.org/jira/browse/ARROW-17590
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
>
> Hi,
> When I read a large parquet file with filter for a small number of rows, the 
> memory usage is pretty high. The result table and dataframe have only a few 
> rows. Looks like it scans/loads many rows from the parquet file. Not only the 
> footprint or watermark of memory usage is high, but also it seems not 
> releasing the memory in time (such as after GC in Python, but may get used 
> for subsequent read).
> The filtered column is not a partition key, which functionally works to get a 
> small number of rows. But the memory usage is high when the parquet 
> (partitioned or not) is large. There were some references related to this 
> issue, for example: 
> [https://github.com/apache/arrow/issues/7338|https://github.com/apache/arrow/issues/7338.]
> Related classes/methods in (pyarrow 9.0.0) 
> _ParquetDatasetV2.read
>     self._dataset.to_table(columns=columns, filter=self._filter_expression, 
> use_threads=use_threads)
> pyarrow._dataset.FileSystemDatase.to_table
> I played with pyarrow._dataset.Scanner.to_table
>     self._dataset.scanner(columns=columns, 
> filter=self._filter_expression).to_table()
> The memory usage is small to construct the scanner but then goes up after the 
> to_table call materializes it.
> Is there some way or workaround to reduce the memory usage with filters? 
> If not supported yet, can it be fixed/improved with priority? 
> This is a blocking issue for us. I don't see if it can be patched in the 
> Python code.
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17590) Lower memory usage with filters

2022-08-31 Thread Yin (Jira)
Yin created ARROW-17590:
---

 Summary: Lower memory usage with filters
 Key: ARROW-17590
 URL: https://issues.apache.org/jira/browse/ARROW-17590
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Yin


Hi,
When I read a large parquet file with filter for a small number of rows, the 
memory usage is pretty high. The result table and dataframe have only a few 
rows. Looks like it scans/loads many rows from the parquet file. Not only the 
footprint or watermark of memory usage is high, but also it seems not releasing 
the memory in time (such as after GC in Python, but may get used for subsequent 
read).

The filtered column is not a partition key, which functionally works to get a 
small number of rows. But the memory usage is high when the parquet 
(partitioned or not) is large. There were some references related to this 
issue, for example: [https://github.com/apache/arrow/issues/7338.] 

Related classes/methods in (pyarrow 9.0.0) 

_ParquetDatasetV2.read
    self._dataset.to_table(columns=columns, filter=self._filter_expression, 
use_threads=use_threads)

pyarrow._dataset.FileSystemDatase.to_table

I played with pyarrow._dataset.Scanner.to_table
    self._dataset.scanner(columns=columns, 
filter=self._filter_expression).to_table()
The memory usage is small to construct the scanner but then goes up after the 
to_table call materializes it.

Is there some way or workaround to reduce the memory usage with filters? 
If not supported yet, can it be fixed/improved with priority? 
This is a blocking issue for us. I don't see if it can be patched in the Python 
code.

Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15724) [C++] Reduce directory and file IO when reading partition parquet dataset with partition key filters

2022-02-17 Thread Yin (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494296#comment-17494296
 ] 

Yin commented on ARROW-15724:
-

David, Thanks. Good to know. 
Looks like both FileSystemDatasetFactory for use_legacy_dataset=False, and 
ParquetManifest in pyarrow code for use_legacy_dataset=True may take into 
account of the filtered partition keys without discovering all directories and 
files. Or maybe they can be cached or passed in for the following calls to be 
faster.




 

> [C++] Reduce directory and file IO when reading partition parquet dataset 
> with partition key filters
> 
>
> Key: ARROW-15724
> URL: https://issues.apache.org/jira/browse/ARROW-15724
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yin
>Priority: Major
>  Labels: dataset
> Attachments: pq.py
>
>
> Hi,
> It seems that Arrow accesses all partitions directories (and even each 
> parquet files), including those clearly not matching with the partition key 
> values in the filter criteria. This may cause multiple time of difference 
> between accessing one partition directly vs accessing with partition key 
> filters, 
> specially on Network file system, and on local file system when there are 
> lots of partitions, e.g. 1/10th of second vs seconds.
> Attached some Python code to create example dataframe and save parquet 
> datasets with different hive partition structures (/y=/m=/d=, or /y=/m=, or 
> /dk=). And read the datasets with/without filters to reproduce the issue. 
> Observe the run time, and the directories and files which are accessed by the 
> process in Process Monitor on Windows.
> In the three partition structures, I saw in Process Monitor that all 
> directories are accessed regardless of use_legacy_dataset=True or False. 
> When use_legacy_dataset=False, the parquet files in all directories were 
> opened and closed.  
> The argument validate_schema=False made small time difference, but still 
> opens the partition directories, and it's only supported when 
> use_legacy_dataset=True, and not supported/passed in from pandas read_parquet 
> wrapper API. 
> The /y=/m= is faster because there is no daily partition so less directories 
> and files.
> There was a related another stackoverflow question and example 
> [https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
> and there was a comment on the partition discovery:
> {quote}It should get discovered automatically. pd.read_parquet calls 
> pyarrow.parquet.read_table and the default partitioning behavior should be to 
> discover hive-style partitions (i.e. the ones you have). The fact that you 
> have to specify this means that discovery is failing. If you could create a 
> reproducible example and submit it to Arrow JIRA it would be helpful. 
> – Pace  Feb 24 2021 at 18:55"
> {quote}
> Wonder if there were some related Jira here already.
> I tried passing in partitioning argument, but it didn't help. 
> The version of pyarrow used were 1.01, 5, and 7.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15724) [C++] Reduce directory and file IO when reading partition parquet dataset with partition key filters

2022-02-17 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-15724:

Summary: [C++] Reduce directory and file IO when reading partition parquet 
dataset with partition key filters  (was: [C++] Reduce directory and file IO 
when reading partition parquet dataset)

> [C++] Reduce directory and file IO when reading partition parquet dataset 
> with partition key filters
> 
>
> Key: ARROW-15724
> URL: https://issues.apache.org/jira/browse/ARROW-15724
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yin
>Priority: Major
>  Labels: dataset
> Attachments: pq.py
>
>
> Hi,
> It seems that Arrow accesses all partitions directories (and even each 
> parquet files), including those clearly not matching with the partition key 
> values in the filter criteria. This may cause multiple time of difference 
> between accessing one partition directly vs accessing with partition key 
> filters, 
> specially on Network file system, and on local file system when there are 
> lots of partitions, e.g. 1/10th of second vs seconds.
> Attached some Python code to create example dataframe and save parquet 
> datasets with different hive partition structures (/y=/m=/d=, or /y=/m=, or 
> /dk=). And read the datasets with/without filters to reproduce the issue. 
> Observe the run time, and the directories and files which are accessed by the 
> process in Process Monitor on Windows.
> In the three partition structures, I saw in Process Monitor that all 
> directories are accessed regardless of use_legacy_dataset=True or False. 
> When use_legacy_dataset=False, the parquet files in all directories were 
> opened and closed.  
> The argument validate_schema=False made small time difference, but still 
> opens the partition directories, and it's only supported when 
> use_legacy_dataset=True, and not supported/passed in from pandas read_parquet 
> wrapper API. 
> The /y=/m= is faster because there is no daily partition so less directories 
> and files.
> There was a related another stackoverflow question and example 
> [https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
> and there was a comment on the partition discovery:
> {quote}It should get discovered automatically. pd.read_parquet calls 
> pyarrow.parquet.read_table and the default partitioning behavior should be to 
> discover hive-style partitions (i.e. the ones you have). The fact that you 
> have to specify this means that discovery is failing. If you could create a 
> reproducible example and submit it to Arrow JIRA it would be helpful. 
> – Pace  Feb 24 2021 at 18:55"
> {quote}
> Wonder if there were some related Jira here already.
> I tried passing in partitioning argument, but it didn't help. 
> The version of pyarrow used were 1.01, 5, and 7.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-15724) reduce directory and file IO when reading partition parquet dataset

2022-02-17 Thread Yin (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin updated ARROW-15724:

Description: 
Hi,
It seems that Arrow accesses all partitions directories (and even each parquet 
files), including those clearly not matching with the partition key values in 
the filter criteria. This may cause multiple time of difference between 
accessing one partition directly vs accessing with partition key filters, 
specially on Network file system, and on local file system when there are lots 
of partitions, e.g. 1/10th of second vs seconds.

Attached some Python code to create example dataframe and save parquet datasets 
with different hive partition structures (/y=/m=/d=, or /y=/m=, or /dk=). And 
read the datasets with/without filters to reproduce the issue. Observe the run 
time, and the directories and files which are accessed by the process in 
Process Monitor on Windows.

In the three partition structures, I saw in Process Monitor that all 
directories are accessed regardless of use_legacy_dataset=True or False. 
When use_legacy_dataset=False, the parquet files in all directories were opened 
and closed.  
The argument validate_schema=False made small time difference, but still opens 
the partition directories, and it's only supported when 
use_legacy_dataset=True, and not supported/passed in from pandas read_parquet 
wrapper API. 

The /y=/m= is faster because there is no daily partition so less directories 
and files.

There was a related another stackoverflow question and example 
[https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
and there was a comment on the partition discovery:
{quote}It should get discovered automatically. pd.read_parquet calls 
pyarrow.parquet.read_table and the default partitioning behavior should be to 
discover hive-style partitions (i.e. the ones you have). The fact that you have 
to specify this means that discovery is failing. If you could create a 
reproducible example and submit it to Arrow JIRA it would be helpful. 
– Pace  Feb 24 2021 at 18:55"
{quote}
Wonder if there were some related Jira here already.
I tried passing in partitioning argument, but it didn't help. 
The version of pyarrow used were 1.01, 5, and 7.

  was:
Hi,
It seems Arrow accesses all partitions directories (even the parquet files), 
including those clearly not matching with the partition key values in the 
filters. This may cause multiple times  difference between accessing one 
partition directly vs accessing with partition key filters, 
specially on Network file system, and on local file system when there are lots 
of partitions, e.g. 1/10th of second vs seconds.

Attached Python code to create example dataframe and save parquet datasets with 
different hive partition structure (/y=/m=/d=, or /y=/m=, or /dk=). And read 
the datasets with/without filters to reproduce the issue. Observe the run time, 
and the directories and files which are accessed by the process in Process 
Monitor on Windows.

In the three partition structures, I saw in Process Monitor that all 
directories are accessed regardless of use_legacy_dataset=True or False. 
When use_legacy_dataset=False, the parquet files in all directories were 
opened.  
The argument validate_schema=False made small time difference, but still opens 
the partition directories, and it's only supported when 
use_legacy_dataset=True, and not supported/passed in from pandas read_parquet 
wrapper API. 

The /y=/m= is faster since there is no daily partition so less directories and 
files.


There was a related another stackoverflow question and example 
[https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
and there was a comment on the partition discovery:
{quote}It should get discovered automatically. pd.read_parquet calls 
pyarrow.parquet.read_table and the default partitioning behavior should be to 
discover hive-style partitions (i.e. the ones you have). The fact that you have 
to specify this means that discovery is failing. If you could create a 
reproducible example and submit it to Arrow JIRA it would be helpful. 
– Pace  Feb 24 2021 at 18:55"
{quote}
Wonder if there was some related Jira here already.
I tried passing in partitioning argument, it didn't help. 
The version of pyarrow used were 1.01, 5, and 7.


> reduce directory and file IO when reading partition parquet dataset
> ---
>
> Key: ARROW-15724
> URL: https://issues.apache.org/jira/browse/ARROW-15724
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Yin
>Priority: Major
> Attachments: pq.py
>
>
> Hi,
> It seems that Arrow accesses all partitions directories (and even each 
> parquet files), i

[jira] [Created] (ARROW-15724) reduce directory and file IO when reading partition parquet dataset

2022-02-17 Thread Yin (Jira)
Yin created ARROW-15724:
---

 Summary: reduce directory and file IO when reading partition 
parquet dataset
 Key: ARROW-15724
 URL: https://issues.apache.org/jira/browse/ARROW-15724
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Yin
 Attachments: pq.py

Hi,
It seems Arrow accesses all partitions directories (even the parquet files), 
including those clearly not matching with the partition key values in the 
filters. This may cause multiple times  difference between accessing one 
partition directly vs accessing with partition key filters, 
specially on Network file system, and on local file system when there are lots 
of partitions, e.g. 1/10th of second vs seconds.

Attached Python code to create example dataframe and save parquet datasets with 
different hive partition structure (/y=/m=/d=, or /y=/m=, or /dk=). And read 
the datasets with/without filters to reproduce the issue. Observe the run time, 
and the directories and files which are accessed by the process in Process 
Monitor on Windows.

In the three partition structures, I saw in Process Monitor that all 
directories are accessed regardless of use_legacy_dataset=True or False. 
When use_legacy_dataset=False, the parquet files in all directories were 
opened.  
The argument validate_schema=False made small time difference, but still opens 
the partition directories, and it's only supported when 
use_legacy_dataset=True, and not supported/passed in from pandas read_parquet 
wrapper API. 

The /y=/m= is faster since there is no daily partition so less directories and 
files.


There was a related another stackoverflow question and example 
[https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
and there was a comment on the partition discovery:
{quote}It should get discovered automatically. pd.read_parquet calls 
pyarrow.parquet.read_table and the default partitioning behavior should be to 
discover hive-style partitions (i.e. the ones you have). The fact that you have 
to specify this means that discovery is failing. If you could create a 
reproducible example and submit it to Arrow JIRA it would be helpful. 
– Pace  Feb 24 2021 at 18:55"
{quote}
Wonder if there was some related Jira here already.
I tried passing in partitioning argument, it didn't help. 
The version of pyarrow used were 1.01, 5, and 7.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)