[jira] [Commented] (ARROW-16391) pd.read_parquet using filters consuming too much memory

Weston Pace (Jira) Wed, 27 Apr 2022 20:44:09 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529175#comment-17529175
 ]


Weston Pace commented on ARROW-16391:
-------------------------------------

We've worked on this as part of 8.0.0 (see ARROW-15410).  Are you willing to 
try the 8.0.0 release candidate to see if you get any different behavior?  It 
looks like you are installing pyarrow via pip so I think the latest RC is 
available here: https://apache.jfrog.io/ui/native/arrow/python-rc/8.0.0-rc1



> pd.read_parquet using filters consuming too much memory
> -------------------------------------------------------
>
>                 Key: ARROW-16391
>                 URL: https://issues.apache.org/jira/browse/ARROW-16391
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 4.0.1, 7.0.0
>         Environment:  Hardware Overview:
>       Model Name: MacBook Pro
>       Model Identifier: MacBookPro12,1
>       Processor Name: Dual-Core Intel Core i5
>       Processor Speed: 2.7 GHz
>       Number of Processors: 1
>       Total Number of Cores: 2
>       L2 Cache (per Core): 256 KB
>       L3 Cache: 3 MB
>       Hyper-Threading Technology: Enabled
>       Memory: 8 GB
>  System Software Overview:
>       System Version: macOS 10.15.7 (19H1217)
>       Kernel Version: Darwin 19.6.0
>       Boot Volume: Macintosh HD
>       Boot Mode: Normal
>            Reporter: Luis E Pastrana
>            Priority: Major
>
> Hello!
>  
> I have found that pyarrow versions *>= 4.0.1* use more than *2x* memory (RSS) 
> when trying to read a parquet using file-level filters. Using the following 
> dataset:
> {code:java}
> import pandas as pd
> import numpy as np
> a = np.random.randint(1,50,(4_000_000,4))
> df = pd.DataFrame(a, columns=['A','B','C','D']).to_parquet("test.pq", 
> index=False) {code}
> and the reader script ({*}read_with_filters.py{*})
> {code:java}
> import pyarrow as pa
> import pandas as pd
> print(f"pyarrow version: {pa.__version__}")
> print(f"pandas version: {pd.__version__}")
> tmp = pd.read_parquet("test.pq", engine='pyarrow', use_legacy_dataset=False, 
> filters=[("B","=",10)])
> print(tmp.shape) {code}
> I get:
>  
> *Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 1.0.1
> pandas version: 1.4.2
> (81833, 4)
> RSS (Kb): 84876 | user (sec): 0.87 | system (sec): 0.32 | real (sec) : 
> 1.32{code}
> *Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 4.0.1
> pandas version: 1.4.2
> (81833, 4)
> RSS (Kb): 172816 | user (sec): 0.77 | system (sec): 0.24 | real (sec) : 0.72 
> {code}
> *Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 7.0.0
> pandas version: 1.4.2
> (81833, 4)
> RSS (Kb): 240112 | user (sec): 0.71 | system (sec): 0.22 | real (sec) : 0.82 
> {code}
>  
> It is more evident when using larger dataset. However, my personal computer 
> hangs when trying to read larger datasets using *4.0.1* and {*}7.0.0{*}. 
> (That should be a separate issue)
>  
> It is worth mentioning that you see a relative the same memory usage when I 
> removed the *filters* keyword.
> *Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)*
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 1.0.1
> pandas version: 1.4.2
> (4000000, 4)
> RSS (Kb): 331424 | user (sec): 0.89 | system (sec): 0.39 | real (sec) : 1.07 
> {code}
> {*}Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip){*}{*}{{*}}
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 4.0.1
> pandas version: 1.4.2
> (4000000, 4)
> RSS (Kb): 405916 | user (sec): 0.81 | system (sec): 0.42 | real (sec) : 0.81 
> {code}
> {*}Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip){*}{*}{{*}}
> {code:java}
> gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" 
> python read_with_filters.py
> pyarrow version: 7.0.0
> pandas version: 1.4.2
> (4000000, 4)
> RSS (Kb): 364152 | user (sec): 0.78 | system (sec): 0.45 | real (sec) : 1.27 
> {code}
> Thank you,
> Luis



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16391) pd.read_parquet using filters consuming too much memory

Reply via email to