(incubator-gluten) branch main updated: [VL] Example workload for benchmarking Gluten + Delta on TPC-DS datasets (#10614)

hongze Sat, 20 Sep 2025 18:06:14 -0700

This is an automated email from the ASF dual-hosted git repository.

hongze pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-gluten.git



The following commit(s) were added to refs/heads/main by this push:
     new 7a7b93cd9f [VL] Example workload for benchmarking Gluten + Delta on 
TPC-DS datasets (#10614)
7a7b93cd9f is described below

commit 7a7b93cd9fa0e3bb6ee08b4826cda787f4c46ac9
Author: Hongze Zhang <[email protected]>
AuthorDate: Mon Sep 15 11:58:21 2025 +0200

    [VL] Example workload for benchmarking Gluten + Delta on TPC-DS datasets 
(#10614)
---
 tools/workload/tpcds-delta/README.md               | 45 ++++++++++
 .../tpcds-delta/gen_data/tpcds_datagen_delta.sh    | 38 +++++++++
 tools/workload/tpcds-delta/run_tpcds/run_tpcds.sh  | 47 +++++++++++
 .../tpcds-delta/run_tpcds/tpcds_delta.scala        | 97 ++++++++++++++++++++++
 4 files changed, 227 insertions(+)

diff --git a/tools/workload/tpcds-delta/README.md 
b/tools/workload/tpcds-delta/README.md
new file mode 100644
index 0000000000..8fa4509cba
--- /dev/null
+++ b/tools/workload/tpcds-delta/README.md
@@ -0,0 +1,45 @@
+# Test on Velox backend with Delta Lake and TPC-DS workload
+
+This workload example is verified with JDK 8, Spark 3.4.4 and Delta 2.4.0.
+
+## Test dataset
+
+Use bash script `tpcds_datagen_delta.sh` to generate the data. The script 
relies on a already-built gluten-it
+executable. To build it, following the steps:
+
+```bash
+cd ${GLUTEN_HOME}/tools/gluten-it/
+mvn clean install -P spark-3.4,delta
+```
+
+Then call the data generator script:
+
+```bash
+cd ${GLUTEN_HOME}/tools/workload/tpcds-delta/gen_data
+./tpcds_datagen_delta.sh
+```
+
+Meanings of the commands that are used in the script are explained as follows:
+
+- `--benchmark-type=ds`:  "ds" for TPC-DS, "h" for TPC-H.
+- `--threads=112`: The parallelism. Ideal to set to the core number.
+- `-s=100`: The scale factor.
+- `--data-dir=/tmp/my-data`: The target table folder. If it doesn't exist, it 
will then be created.
+
+When the command is finished, check the data folder:
+
+```bash
+ls -l /tmp/my-data/
+```
+
+You should see a generated table folder in it:
+
+```bash
+drwxr-xr-x 20 root root 4096 Sep  1 15:13 
tpcds-generated-100.0-delta-partitioned
+```
+
+The folder `tpcds-generated-100.0-delta-partitioned` is the generated Delta 
TPC-DS table. As shown by the folder name, it's partitioned, and with scale 
factor 100.0.
+
+## Test Queries
+We provide the test queries in [TPC-DS 
Queries](../../../tools/gluten-it/common/src/main/resources/tpcds-queries).
+We provide a Scala script in [Run TPC-DS](./run_tpcds) directory about how to 
run TPC-DS queries on the generated Delta tables.
diff --git a/tools/workload/tpcds-delta/gen_data/tpcds_datagen_delta.sh 
b/tools/workload/tpcds-delta/gen_data/tpcds_datagen_delta.sh
new file mode 100755
index 0000000000..cdd91eb5e1
--- /dev/null
+++ b/tools/workload/tpcds-delta/gen_data/tpcds_datagen_delta.sh
@@ -0,0 +1,38 @@
+#!/bin/bash
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -eux
+set -o pipefail
+
+GLUTEN_HOME=/PATH_TO_GLUTEN_HOME
+
+# 1. Switch to gluten-it folder.
+cd ${GLUTEN_HOME}/tools/gluten-it/
+# 2. Set JAVA_HOME. For example, JDK 17.
+export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-arm64
+# 3. Set JVM heap size. For example, 224G.
+export GLUTEN_IT_JVM_ARGS="-Xmx224G"
+# 4. Generate the tables.
+sbin/gluten-it.sh \
+  data-gen-only \
+  --data-source=delta \
+  --local \
+  --benchmark-type=ds \
+  --threads=112 \
+  -s=100 \
+  --gen-partitioned-data \
+  --data-dir=/tmp/my-data
diff --git a/tools/workload/tpcds-delta/run_tpcds/run_tpcds.sh 
b/tools/workload/tpcds-delta/run_tpcds/run_tpcds.sh
new file mode 100755
index 0000000000..048e793d99
--- /dev/null
+++ b/tools/workload/tpcds-delta/run_tpcds/run_tpcds.sh
@@ -0,0 +1,47 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -eux
+set -o pipefail
+
+GLUTEN_JAR=/PATH_TO_GLUTEN_HOME/package/target/<gluten-jar>
+DELTA_JARS=/PATHS_TO_DELTA_JARS
+SPARK_HOME=/PATH_TO_SPARK_HOME/
+
+cat tpcds_delta.scala | ${SPARK_HOME}/bin/spark-shell \
+  --master yarn --deploy-mode client \
+  --packages io.delta:delta-core_2.12:2.4.0 \
+  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
+  --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog 
\
+  --conf spark.plugins=org.apache.gluten.GlutenPlugin \
+  --conf spark.driver.extraClassPath=${GLUTEN_JAR}:${DELTA_JARS} \
+  --conf spark.executor.extraClassPath=${GLUTEN_JAR}:${DELTA_JARS} \
+  --conf spark.memory.offHeap.enabled=true \
+  --conf spark.memory.offHeap.size=2g \
+  --conf spark.gluten.sql.columnar.forceShuffledHashJoin=true \
+  --conf 
spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
+  --num-executors 3 \
+  --executor-cores 3 \
+  --driver-memory 2g \
+  --executor-memory 2g \
+  --conf spark.executor.memoryOverhead=2g \
+  --conf spark.driver.maxResultSize=2g
+
+  # If there are some "*.so" libs dependencies issues on some specific Distros,
+  # try to enable spark.gluten.loadLibFromJar and build your own 
gluten-thirdparty-lib Jar.
+  # e.g.
+  #   --conf spark.gluten.loadLibFromJar=true \
+  #   --jars 
/PATH_TO_GLUTEN_HOME/package/target/thirdparty-lib/gluten-thirdparty-lib-ubuntu-22.04-x86_64.jar,
+  #          
/PATH_TO_GLUTEN_HOME/package/target/gluten-velox-bundle-spark3.3_2.12-ubuntu_22.04_x86_64-1.x.x-SNAPSHOT.jar
diff --git a/tools/workload/tpcds-delta/run_tpcds/tpcds_delta.scala 
b/tools/workload/tpcds-delta/run_tpcds/tpcds_delta.scala
new file mode 100644
index 0000000000..8dd9f27ce5
--- /dev/null
+++ b/tools/workload/tpcds-delta/run_tpcds/tpcds_delta.scala
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import org.apache.spark.sql.execution.debug._
+import scala.io.Source
+import java.io.File
+import java.util.Arrays
+import sys.process._
+
+// Configurations:
+var delta_table_path = "/PATH/TO/TPCDS_DELTA_TABLE_PATH"
+var gluten_root = "/PATH/TO/GLUTEN"
+
+// File root path: file://, hdfs:// , s3 , ...
+// e.g. hdfs://hostname:8020
+var delta_file_root = "/ROOT_PATH"
+
+var tpcds_queries_path = 
"/tools/gluten-it/common/src/main/resources/tpcds-queries/"
+
+def time[R](block: => R): R = {
+    val t0 = System.nanoTime()
+    val result = block    // call-by-name
+    val t1 = System.nanoTime()
+    println("Elapsed time: " + (t1 - t0)/1000000000.0 + " seconds")
+    result
+}
+
+// Create TPC-DS Delta Tables.
+spark.catalog.createTable("call_center", delta_file_root + delta_table_path + 
"/call_center", "delta")
+spark.catalog.createTable("catalog_page", delta_file_root + delta_table_path + 
"/catalog_page", "delta")
+spark.catalog.createTable("catalog_returns", delta_file_root + 
delta_table_path + "/catalog_returns", "delta")
+spark.catalog.createTable("catalog_sales", delta_file_root + delta_table_path 
+ "/catalog_sales", "delta")
+spark.catalog.createTable("customer", delta_file_root + delta_table_path + 
"/customer", "delta")
+spark.catalog.createTable("customer_address", delta_file_root + 
delta_table_path + "/customer_address", "delta")
+spark.catalog.createTable("customer_demographics", delta_file_root + 
delta_table_path + "/customer_demographics", "delta")
+spark.catalog.createTable("date_dim", delta_file_root + delta_table_path + 
"/date_dim", "delta")
+spark.catalog.createTable("household_demographics", delta_file_root + 
delta_table_path + "/household_demographics", "delta")
+spark.catalog.createTable("income_band", delta_file_root + delta_table_path + 
"/income_band", "delta")
+spark.catalog.createTable("inventory", delta_file_root + delta_table_path + 
"/inventory", "delta")
+spark.catalog.createTable("item", delta_file_root + delta_table_path + 
"/item", "delta")
+spark.catalog.createTable("promotion", delta_file_root + delta_table_path + 
"/promotion", "delta")
+spark.catalog.createTable("reason", delta_file_root + delta_table_path + 
"/reason", "delta")
+spark.catalog.createTable("ship_mode", delta_file_root + delta_table_path + 
"/ship_mode", "delta")
+spark.catalog.createTable("store", delta_file_root + delta_table_path + 
"/store", "delta")
+spark.catalog.createTable("store_returns", delta_file_root + delta_table_path 
+ "/store_returns", "delta")
+spark.catalog.createTable("store_sales", delta_file_root + delta_table_path + 
"/store_sales", "delta")
+spark.catalog.createTable("time_dim", delta_file_root + delta_table_path + 
"/time_dim", "delta")
+spark.catalog.createTable("warehouse", delta_file_root + delta_table_path + 
"/warehouse", "delta")
+spark.catalog.createTable("web_page", delta_file_root + delta_table_path + 
"/web_page", "delta")
+spark.catalog.createTable("web_returns", delta_file_root + delta_table_path + 
"/web_returns", "delta")
+spark.catalog.createTable("web_sales", delta_file_root + delta_table_path + 
"/web_sales", "delta")
+spark.catalog.createTable("web_site", delta_file_root + delta_table_path + 
"/web_site", "delta")
+
+def getListOfFiles(dir: String): List[File] = {
+     val d = new File(dir)
+     if (d.exists && d.isDirectory) {
+         // You can run a specific query by using below line.
+         // 
d.listFiles.filter(_.isFile).filter(_.getName().contains("17.sql")).toList
+         d.listFiles.filter(_.isFile).toList
+     } else {
+         List[File]()
+     }
+}
+val fileLists = getListOfFiles(gluten_root + tpcds_queries_path)
+val sorted = fileLists.sortBy {
+       f => f.getName match {
+       case name =>
+         var str = name
+         str = str.replaceFirst("a", ".1")
+         str = str.replaceFirst("b", ".2")
+         str = str.replaceFirst(".sql", "")
+         str = str.replaceFirst("q", "")
+         str.toDouble
+     }}
+
+// Main program to run TPC-DS testing
+for (t <- sorted) {
+  println(t)
+  val fileContents = 
Source.fromFile(t).getLines.filter(!_.startsWith("--")).mkString(" ")
+  println(fileContents)
+  time{spark.sql(fileContents).collectAsList()}
+  // spark.sql(fileContents).explain
+  Thread.sleep(2000)
+}


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(incubator-gluten) branch main updated: [VL] Example workload for benchmarking Gluten + Delta on TPC-DS datasets (#10614)

Reply via email to