[ https://issues.apache.org/jira/browse/ARROW-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529175#comment-17529175 ]
Weston Pace commented on ARROW-16391: ------------------------------------- We've worked on this as part of 8.0.0 (see ARROW-15410). Are you willing to try the 8.0.0 release candidate to see if you get any different behavior? It looks like you are installing pyarrow via pip so I think the latest RC is available here: https://apache.jfrog.io/ui/native/arrow/python-rc/8.0.0-rc1 > pd.read_parquet using filters consuming too much memory > ------------------------------------------------------- > > Key: ARROW-16391 > URL: https://issues.apache.org/jira/browse/ARROW-16391 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 4.0.1, 7.0.0 > Environment: Hardware Overview: > Model Name: MacBook Pro > Model Identifier: MacBookPro12,1 > Processor Name: Dual-Core Intel Core i5 > Processor Speed: 2.7 GHz > Number of Processors: 1 > Total Number of Cores: 2 > L2 Cache (per Core): 256 KB > L3 Cache: 3 MB > Hyper-Threading Technology: Enabled > Memory: 8 GB > System Software Overview: > System Version: macOS 10.15.7 (19H1217) > Kernel Version: Darwin 19.6.0 > Boot Volume: Macintosh HD > Boot Mode: Normal > Reporter: Luis E Pastrana > Priority: Major > > Hello! > > I have found that pyarrow versions *>= 4.0.1* use more than *2x* memory (RSS) > when trying to read a parquet using file-level filters. Using the following > dataset: > {code:java} > import pandas as pd > import numpy as np > a = np.random.randint(1,50,(4_000_000,4)) > df = pd.DataFrame(a, columns=['A','B','C','D']).to_parquet("test.pq", > index=False) {code} > and the reader script ({*}read_with_filters.py{*}) > {code:java} > import pyarrow as pa > import pandas as pd > print(f"pyarrow version: {pa.__version__}") > print(f"pandas version: {pd.__version__}") > tmp = pd.read_parquet("test.pq", engine='pyarrow', use_legacy_dataset=False, > filters=[("B","=",10)]) > print(tmp.shape) {code} > I get: > > *Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)* > {code:java} > gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" > python read_with_filters.py > pyarrow version: 1.0.1 > pandas version: 1.4.2 > (81833, 4) > RSS (Kb): 84876 | user (sec): 0.87 | system (sec): 0.32 | real (sec) : > 1.32{code} > *Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip)* > {code:java} > gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" > python read_with_filters.py > pyarrow version: 4.0.1 > pandas version: 1.4.2 > (81833, 4) > RSS (Kb): 172816 | user (sec): 0.77 | system (sec): 0.24 | real (sec) : 0.72 > {code} > *Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip)* > {code:java} > gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" > python read_with_filters.py > pyarrow version: 7.0.0 > pandas version: 1.4.2 > (81833, 4) > RSS (Kb): 240112 | user (sec): 0.71 | system (sec): 0.22 | real (sec) : 0.82 > {code} > > It is more evident when using larger dataset. However, my personal computer > hangs when trying to read larger datasets using *4.0.1* and {*}7.0.0{*}. > (That should be a separate issue) > > It is worth mentioning that you see a relative the same memory usage when I > removed the *filters* keyword. > *Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)* > {code:java} > gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" > python read_with_filters.py > pyarrow version: 1.0.1 > pandas version: 1.4.2 > (4000000, 4) > RSS (Kb): 331424 | user (sec): 0.89 | system (sec): 0.39 | real (sec) : 1.07 > {code} > {*}Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip){*}{*}{{*}} > {code:java} > gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" > python read_with_filters.py > pyarrow version: 4.0.1 > pandas version: 1.4.2 > (4000000, 4) > RSS (Kb): 405916 | user (sec): 0.81 | system (sec): 0.42 | real (sec) : 0.81 > {code} > {*}Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip){*}{*}{{*}} > {code:java} > gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" > python read_with_filters.py > pyarrow version: 7.0.0 > pandas version: 1.4.2 > (4000000, 4) > RSS (Kb): 364152 | user (sec): 0.78 | system (sec): 0.45 | real (sec) : 1.27 > {code} > Thank you, > Luis -- This message was sent by Atlassian Jira (v8.20.7#820007)