Luis E Pastrana created ARROW-16391: ---------------------------------------
Summary: pd.read_parquet using filters consuming too much memory Key: ARROW-16391 URL: https://issues.apache.org/jira/browse/ARROW-16391 Project: Apache Arrow Issue Type: Bug Affects Versions: 7.0.0, 4.0.1 Environment: Hardware Overview: Model Name: MacBook Pro Model Identifier: MacBookPro12,1 Processor Name: Dual-Core Intel Core i5 Processor Speed: 2.7 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 3 MB Hyper-Threading Technology: Enabled Memory: 8 GB System Software Overview: System Version: macOS 10.15.7 (19H1217) Kernel Version: Darwin 19.6.0 Boot Volume: Macintosh HD Boot Mode: Normal Reporter: Luis E Pastrana Hello! I have found that pyarrow versions *>= 4.0.1* use more than *2x* memory (RSS) when trying to read a parquet using file-level filters. Using the following dataset: {code:java} import pandas as pd import numpy as np a = np.random.randint(1,50,(4_000_000,4)) df = pd.DataFrame(a, columns=['A','B','C','D']).to_parquet("test.pq", index=False) {code} and the reader script ({*}read_with_filters.py{*}) {code:java} import pyarrow as pa import pandas as pd print(f"pyarrow version: {pa.__version__}") print(f"pandas version: {pd.__version__}") tmp = pd.read_parquet("test.pq", engine='pyarrow', use_legacy_dataset=False, filters=[("B","=",10)]) print(tmp.shape) {code} I get: *Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)* {code:java} gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py pyarrow version: 1.0.1 pandas version: 1.4.2 (81833, 4) RSS (Kb): 84876 | user (sec): 0.87 | system (sec): 0.32 | real (sec) : 1.32{code} *Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip)* {code:java} gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py pyarrow version: 4.0.1 pandas version: 1.4.2 (81833, 4) RSS (Kb): 172816 | user (sec): 0.77 | system (sec): 0.24 | real (sec) : 0.72 {code} *Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip)* {code:java} gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py pyarrow version: 7.0.0 pandas version: 1.4.2 (81833, 4) RSS (Kb): 240112 | user (sec): 0.71 | system (sec): 0.22 | real (sec) : 0.82 {code} It is more evident when using larger dataset. However, my personal computer hangs when trying to read larger datasets using *4.0.1* and {*}7.0.0{*}. (That should be a separate issue) It is worth mentioning that you see a relative the same memory usage when I removed the *filters* keyword. *Python 3.8.13 (conda), pyarrow 1.0.1 (pip), pandas 1.4.2 (pip)* {code:java} gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py pyarrow version: 1.0.1 pandas version: 1.4.2 (4000000, 4) RSS (Kb): 331424 | user (sec): 0.89 | system (sec): 0.39 | real (sec) : 1.07 {code} {*}Python 3.8.13 (conda), pyarrow 4.0.1 (pip), pandas 1.4.2 (pip){*}{*}{*} {code:java} gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py pyarrow version: 4.0.1 pandas version: 1.4.2 (4000000, 4) RSS (Kb): 405916 | user (sec): 0.81 | system (sec): 0.42 | real (sec) : 0.81 {code} {*}Python 3.8.13 (conda), pyarrow 7.0.0 (pip), pandas 1.4.2 (pip){*}{*}{*} {code:java} gtime -f "RSS (Kb): %M | user (sec): %U | system (sec): %S | real (sec) : %e" python read_with_filters.py pyarrow version: 7.0.0 pandas version: 1.4.2 (4000000, 4) RSS (Kb): 364152 | user (sec): 0.78 | system (sec): 0.45 | real (sec) : 1.27 {code} Thank you, Luis -- This message was sent by Atlassian Jira (v8.20.7#820007)