[jira] [Created] (ARROW-16391) pd.read_parquet using filters consuming too much memory

Luis E Pastrana (Jira) Wed, 27 Apr 2022 20:31:05 -0700

Luis E Pastrana created ARROW-16391:
---------------------------------------


             Summary: pd.read_parquet using filters consuming too much memory
                 Key: ARROW-16391
                 URL: https://issues.apache.org/jira/browse/ARROW-16391
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 7.0.0, 4.0.1
         Environment:  Hardware Overview:
      Model Name: MacBook Pro
      Model Identifier: MacBookPro12,1
      Processor Name: Dual-Core Intel Core i5
      Processor Speed: 2.7 GHz
      Number of Processors: 1
      Total Number of Cores: 2
      L2 Cache (per Core): 256 KB
      L3 Cache: 3 MB
      Hyper-Threading Technology: Enabled
      Memory: 8 GB

 System Software Overview:
      System Version: macOS 10.15.7 (19H1217)
      Kernel Version: Darwin 19.6.0
      Boot Volume: Macintosh HD
      Boot Mode: Normal


            Reporter: Luis E Pastrana


Hello!

 

I have found that pyarrow versions *>= 4.0.1* use more than *2x* memory (RSS) 
when trying to read a parquet using file-level filters. Using the following 
dataset:

 
{code:java}
import pandas as pd
import numpy as np

a = np.random.randint(1,50,(4_000_000,4))

df = pd.DataFrame(a, columns=['A','B','C','D']).to_parquet("test.pq", 
index=False) {code}
and the reader script ({*}read_with_filters.py{*})

 

 
{code:java}
import pyarrow as pa
import pandas as pd


print(f"pyarrow version: {pa.__version__}")
print(f"pandas version: {pd.__version__}")


tmp = pd.read_parquet("test.pq", engine='pyarrow', use_legacy_dataset=False, 
filters=[("B","=",10)])


print(tmp.shape) {code}
I get:

*Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)*

 
{code:java}
gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
python read_with_filters.py
pyarrow version: 1.0.1
pandas version: 1.4.2
(81833, 4)
RSS (Kb): 84876 | user (sec): 0.87 | system (sec): 0.32 | real (sec) : 
1.32{code}
 

 

*Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip)*
{code:java}
gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
python read_with_filters.py
pyarrow version: 4.0.1
pandas version: 1.4.2
(81833, 4)
RSS (Kb): 172816 | user (sec): 0.77 | system (sec): 0.24 | real (sec) : 0.72 
{code}
 

*Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip)*

 
{code:java}
gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
python read_with_filters.py
pyarrow version: 7.0.0
pandas version: 1.4.2
(81833, 4)
RSS (Kb): 240112 | user (sec): 0.71 | system (sec): 0.22 | real (sec) : 0.82 
{code}
 

It is more evident when using larger dataset. However, my personal computer 
hangs when trying to read larger datasets using *4.0.1* and {*}7.0.0{*}. (That 
should be a separate issue)

 

 

 

It is worth mentioning that you see a relative the same memory usage when I 
removed the *filters* keyword.

*Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)*
{code:java}
gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
python read_with_filters.py
pyarrow version: 1.0.1
pandas version: 1.4.2
(4000000, 4)
RSS (Kb): 331424 | user (sec): 0.89 | system (sec): 0.39 | real (sec) : 1.07 
{code}
{*}Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip){*}{*}{*}
{code:java}
gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
python read_with_filters.py
pyarrow version: 4.0.1
pandas version: 1.4.2
(4000000, 4)
RSS (Kb): 405916 | user (sec): 0.81 | system (sec): 0.42 | real (sec) : 0.81 
{code}
{*}Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip){*}{*}{*}
{code:java}
gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
python read_with_filters.py
pyarrow version: 7.0.0
pandas version: 1.4.2
(4000000, 4)
RSS (Kb): 364152 | user (sec): 0.78 | system (sec): 0.45 | real (sec) : 1.27 
{code}
Thank you,

Luis



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (ARROW-16391) pd.read_parquet using filters consuming too much memory

Reply via email to