Casey created ARROW-6985: ---------------------------- Summary: Steadily increasing time to load file using read_parquet Key: ARROW-6985 URL: https://issues.apache.org/jira/browse/ARROW-6985 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.15.0, 0.14.0, 0.13.0 Reporter: Casey Fix For: 0.15.0, 0.14.0, 0.13.0
I've noticed that reading from parquet using pandas read_parquet function is taking steadily longer with each invocation. I've seen the other ticket about memory usage but I'm seeing no memory impact just steadily increasing read time until I restart the python session. Below is some code to reproduce my results. I notice it's particularly bad on wide matrices, especially using pyarrow==0.15.0 {code:python} import pyarrow.parquet as pq import pyarrow as pa import pandas as pd import os import numpy as np import time file = "skinny_matrix.pq" if not os.path.isfile(file): mat = np.zeros((6000, 26000)) mat.ravel()[::100] = np.random.randn(60 * 26000) df = pd.DataFrame(mat.T) table = pa.Table.from_pandas(df) pq.write_table(table, file) n_timings = 50 timings = np.empty(n_timings) for i in range(n_timings): start = time.time() new_df = pd.read_parquet(file) end = time.time() timings[i] = end - start {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)