mimoune djouallah created ARROW-17679:
-----------------------------------------

             Summary: slow performance when reading data from GCP
                 Key: ARROW-17679
                 URL: https://issues.apache.org/jira/browse/ARROW-17679
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet, Python
    Affects Versions: 9.0.0
            Reporter: mimoune djouallah


I am using pyarrow and duckdb to query some parquet files in GCP, thanks for 
making  the experience so smooth, but I have an issue with the performance, see 
code used.
import pyarrow.dataset as ds
import duckdb
import json
lineitem = ds.dataset("gs://duckddelta/lineitem")
lineitem_partition = ds.dataset("gs://duckddelta/delta2",format="parquet", 
partitioning="hive")
lineitem_180 = ds.dataset("gs://duckddelta/lineitem_180",format="parquet", 
partitioning="hive")
con = duckdb.connect()
con.register("lineitem", lineitem)
con.register("lineitem_partition", lineitem_partition)
con.register("lineitem_180", lineitem_180)
def Query(request):
    SQL = request.get_json().get('name')
    df = con.execute(SQL).df()
    return json.dumps(df.to_json(orient="records")), 200, \{'Content-Type': 
'application/json'}
 
the issue is I am getting slow some extremely slow throughput performance, 
around 30 MBper second, the same files using local ssd laptop is extremely fast.
I am not sure what's the issue, I tried using pyarrow compute Query and it is 
the same performance 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to