This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git


The following commit(s) were added to refs/heads/master by this push:
     new 58031aa0 Add Python script to run benchmarks (#302)
58031aa0 is described below

commit 58031aa02e8226681df91b50573b21190a6ce5df
Author: Andy Grove <[email protected]>
AuthorDate: Sat Oct 8 10:59:59 2022 -0600

    Add Python script to run benchmarks (#302)
---
 benchmarks/README.md | 22 +++++++++++++++++++-
 benchmarks/tpch.py   | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/benchmarks/README.md b/benchmarks/README.md
index bd32f415..6b0acbaa 100644
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@@ -39,7 +39,27 @@ generator.
 Data will be generated into the `data` subdirectory and will not be checked in 
because this directory has been added
 to the `.gitignore` file.
 
-## Running the DataFusion Benchmarks
+## Running the DataFusion Benchmarks in Python
+
+Build the Python bindings and then run:
+
+```bash
+$ python tpch.py --query q1 --path /mnt/bigdata/tpch/sf1-parquet/ 
+Registering table part at path /mnt/bigdata/tpch/sf1-parquet//part
+Registering table supplier at path /mnt/bigdata/tpch/sf1-parquet//supplier
+Registering table partsupp at path /mnt/bigdata/tpch/sf1-parquet//partsupp
+Registering table customer at path /mnt/bigdata/tpch/sf1-parquet//customer
+Registering table orders at path /mnt/bigdata/tpch/sf1-parquet//orders
+Registering table lineitem at path /mnt/bigdata/tpch/sf1-parquet//lineitem
+Registering table nation at path /mnt/bigdata/tpch/sf1-parquet//nation
+Registering table region at path /mnt/bigdata/tpch/sf1-parquet//region
+Query q1 took 9.668351173400879 second(s)
+```
+
+Note that this Python script currently only supports running against file 
formats than contain a schema 
+definition (such as Parquet).
+
+## Running the DataFusion Benchmarks in Rust
 
 The benchmark can then be run (assuming the data created from `dbgen` is in 
`./data`) with a command such as:
 
diff --git a/benchmarks/tpch.py b/benchmarks/tpch.py
new file mode 100644
index 00000000..946e47e7
--- /dev/null
+++ b/benchmarks/tpch.py
@@ -0,0 +1,58 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import sys
+import time
+import argparse
+
+parser = argparse.ArgumentParser(description='Run SQL benchmarks.')
+parser.add_argument('--query', help='query to run, such as q1')
+parser.add_argument('--path', help='path to data files')
+parser.add_argument('--ext', default='', help='optional file extension, such 
as parquet')
+
+args = parser.parse_args()
+
+query = args.query
+path = args.path
+table_ext = args.ext
+
+import ballista
+ctx = ballista.BallistaContext("localhost", 50050)
+
+tables = ["part", "supplier", "partsupp", "customer", "orders", "lineitem", 
"nation", "region"]
+
+for table in tables:
+    table_path = path + "/" + table
+    if len(table_ext) > 0:
+        table_path = table_path + "." + table_ext
+    print("Registering table", table, "at path", table_path)
+    ctx.register_parquet(table, table_path)
+
+with open("queries/" + query + ".sql", 'r') as file:
+    sql = file.read()
+
+import time
+
+start = time.time()
+
+df = ctx.sql(sql)
+df.show()
+
+end = time.time()
+print("Query", query, "took", end - start, "second(s)")
+
+

Reply via email to