mimoune djouallah created ARROW-17679: -----------------------------------------
Summary: slow performance when reading data from GCP Key: ARROW-17679 URL: https://issues.apache.org/jira/browse/ARROW-17679 Project: Apache Arrow Issue Type: Bug Components: Parquet, Python Affects Versions: 9.0.0 Reporter: mimoune djouallah I am using pyarrow and duckdb to query some parquet files in GCP, thanks for making the experience so smooth, but I have an issue with the performance, see code used. import pyarrow.dataset as ds import duckdb import json lineitem = ds.dataset("gs://duckddelta/lineitem") lineitem_partition = ds.dataset("gs://duckddelta/delta2",format="parquet", partitioning="hive") lineitem_180 = ds.dataset("gs://duckddelta/lineitem_180",format="parquet", partitioning="hive") con = duckdb.connect() con.register("lineitem", lineitem) con.register("lineitem_partition", lineitem_partition) con.register("lineitem_180", lineitem_180) def Query(request): SQL = request.get_json().get('name') df = con.execute(SQL).df() return json.dumps(df.to_json(orient="records")), 200, \{'Content-Type': 'application/json'} the issue is I am getting slow some extremely slow throughput performance, around 30 MBper second, the same files using local ssd laptop is extremely fast. I am not sure what's the issue, I tried using pyarrow compute Query and it is the same performance -- This message was sent by Atlassian Jira (v8.20.10#820010)