(datafusion-comet) branch asf-site updated: Publish built docs triggered by e2383921f714c857c31bbbc1a2f427bb0608b46c

github-bot Wed, 09 Apr 2025 15:58:42 -0700

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 66a5bca66 Publish built docs triggered by 
e2383921f714c857c31bbbc1a2f427bb0608b46c
66a5bca66 is described below

commit 66a5bca66b9f9eebe97186889a012e025a490a46
Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AuthorDate: Wed Apr 9 22:58:01 2025 +0000

    Publish built docs triggered by e2383921f714c857c31bbbc1a2f427bb0608b46c
---
 _sources/contributor-guide/benchmarking.md.txt     |   4 +
 .../contributor-guide/benchmarking_aws_ec2.md.txt  | 223 ++++++++
 contributor-guide/benchmarking.html                |   4 +
 contributor-guide/benchmarking_aws_ec2.html        | 567 +++++++++++++++++++++
 objects.inv                                        | Bin 786 -> 807 bytes
 searchindex.js                                     |   2 +-
 6 files changed, 799 insertions(+), 1 deletion(-)

diff --git a/_sources/contributor-guide/benchmarking.md.txt 
b/_sources/contributor-guide/benchmarking.md.txt
index 1193ada62..15934d7f5 100644
--- a/_sources/contributor-guide/benchmarking.md.txt
+++ b/_sources/contributor-guide/benchmarking.md.txt
@@ -22,6 +22,10 @@ under the License.
 To track progress on performance, we regularly run benchmarks derived from 
TPC-H and TPC-DS. Data generation and 
 benchmarking documentation and scripts are available in the [DataFusion 
Benchmarks](https://github.com/apache/datafusion-benchmarks) GitHub repository.
 
+Available benchmarking guides:
+
+- [Benchmarking on AWS EC2](benchmarking_aws_ec2) 
+
 We also have many micro benchmarks that can be run from an IDE located 
[here](https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark).
 
 
 ## Current Benchmark Results
diff --git a/_sources/contributor-guide/benchmarking_aws_ec2.md.txt 
b/_sources/contributor-guide/benchmarking_aws_ec2.md.txt
new file mode 100644
index 000000000..0ec33bf7e
--- /dev/null
+++ b/_sources/contributor-guide/benchmarking_aws_ec2.md.txt
@@ -0,0 +1,223 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Comet Benchmarking in AWS
+
+This guide is for setting up benchmarks on AWS EC2 with a single node with 
Parquet files located in S3.
+
+## Data Generation
+
+- Create an EC2 instance with an EBS volume sized for approximately 2x the 
size of
+  the dataset to be generated (200 GB for scale factor 100, 2 TB for scale 
factor 1000, and so on)
+- Create an S3 bucket to store the Parquet files
+
+Install prerequisites:
+
+```shell
+sudo yum install -y docker git python3-pip
+
+sudo systemctl start docker
+sudo systemctl enable docker
+sudo usermod -aG docker ec2-user
+newgrp docker
+
+docker pull ghcr.io/scalytics/tpch-docker:main
+
+pip3 install datafusion
+```
+
+Run the data generation script:
+
+```shell
+git clone https://github.com/apache/datafusion-benchmarks.git
+cd datafusion-benchmarks/tpch
+nohup python3 tpchgen.py generate --scale-factor 100 --partitions 16 &
+```
+
+Check on progress with the following commands:
+
+```shell
+docker ps
+du -h -d 1 data
+```
+
+Fix ownership in the generated files:
+
+```shell
+sudo chown -R ec2-user:docker data
+```
+
+Convert to Parquet:
+
+```shell
+nohup python3 tpchgen.py convert --scale-factor 100 --partitions 16 &
+```
+
+Delete the CSV files:
+
+```shell
+cd data
+rm *.tbl.*
+```
+
+Copy the Parquet files to S3:
+
+```shell
+aws s3 cp . s3://your-bucket-name/top-level-folder/ --recursive
+```
+
+## Install Spark
+
+Install Java
+
+```shell
+sudo yum install -y java-17-amazon-corretto-headless 
java-17-amazon-corretto-devel
+```
+
+Set JAVA_HOME
+
+```shell
+export JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto
+```
+
+Install Spark
+
+```shell
+wget 
https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
+tar xzf spark-3.5.4-bin-hadoop3.tgz
+sudo mv spark-3.5.4-bin-hadoop3 /opt
+export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/
+mkdir /tmp/spark-events
+```
+
+Set `SPARK_MASTER` env var (IP address will need to be edited):
+
+```shell
+export SPARK_MASTER=spark://172.31.34.87:7077
+```
+
+Set `SPARK_LOCAL_DIRS` to point to EBS volume
+
+```shell
+sudo mkdir /mnt/tmp
+sudo chmod 777 /mnt/tmp
+mv $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
+```
+
+Add the following entry to `spark-env.sh`:
+
+```shell
+SPARK_LOCAL_DIRS=/mnt/tmp
+```
+
+Start Spark in standalone mode:
+
+```shell
+$SPARK_HOME/sbin/start-master.sh
+$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER
+```
+
+Install Hadoop jar files:
+
+```shell
+wget 
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
 -P $SPARK_HOME/jars
+wget 
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar
 -P $SPARK_HOME/jars
+```
+
+Add credentials to `~/.aws/credentials`:
+
+```shell
+[default]
+aws_access_key_id=your-access-key
+aws_secret_access_key=your-secret-key
+```
+
+## Run Spark Benchmarks
+
+Run the following command (the `--data` parameter will need to be updated to 
point to your S3 bucket):
+
+```shell
+$SPARK_HOME/bin/spark-submit \
+  --master $SPARK_MASTER \
+  --conf spark.driver.memory=4G \
+  --conf spark.executor.instances=1 \
+  --conf spark.executor.cores=8 \
+  --conf spark.cores.max=8 \
+  --conf spark.executor.memory=16g \
+  --conf spark.eventLog.enabled=false \
+  --conf spark.local.dir=/mnt/tmp \
+  --conf spark.driver.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
+  --conf spark.executor.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
+  --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
+  --conf 
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
 \
+  tpcbench.py \
+  --benchmark tpch \
+  --data s3a://your-bucket-name/top-level-folder \
+  --queries /home/ec2-user/datafusion-benchmarks/tpch/queries \
+  --output . \
+  --iterations 1
+```
+
+## Run Comet Benchmarks
+
+Install Comet JAR from Maven:
+
+```shell
+wget 
https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.7.0/comet-spark-spark3.5_2.12-0.7.0.jar
 -P $SPARK_HOME/jars
+export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.7.0.jar
+```
+
+Run the following command (the `--data` parameter will need to be updated to 
point to your S3 bucket):
+
+```shell
+$SPARK_HOME/bin/spark-submit \
+  --master $SPARK_MASTER \
+  --conf spark.driver.memory=4G \
+  --conf spark.executor.instances=1 \
+  --conf spark.executor.cores=8 \
+  --conf spark.cores.max=8 \
+  --conf spark.executor.memory=16g \
+  --conf spark.memory.offHeap.enabled=true \
+  --conf spark.memory.offHeap.size=16g \
+  --conf spark.eventLog.enabled=false \
+  --conf spark.local.dir=/mnt/tmp \
+  --conf spark.driver.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
+  --conf spark.executor.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
+  --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
+  --conf 
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
 \
+  --jars $COMET_JAR \
+  --driver-class-path $COMET_JAR \
+  --conf spark.driver.extraClassPath=$COMET_JAR \
+  --conf spark.executor.extraClassPath=$COMET_JAR \
+  --conf spark.plugins=org.apache.spark.CometPlugin \
+  --conf 
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
 \
+  --conf spark.comet.enabled=true \
+  --conf spark.comet.cast.allowIncompatible=true \
+  --conf spark.comet.exec.replaceSortMergeJoin=true \
+  --conf spark.comet.exec.shuffle.enabled=true \
+  --conf spark.comet.exec.shuffle.fallbackToColumnar=true \
+  --conf spark.comet.exec.shuffle.compression.codec=lz4 \
+  --conf spark.comet.exec.shuffle.compression.level=1 \
+  tpcbench.py \
+  --benchmark tpch \
+  --data s3a://your-bucket-name/top-level-folder \
+  --queries /home/ec2-user/datafusion-benchmarks/tpch/queries \
+  --output . \
+  --iterations 1
+```
diff --git a/contributor-guide/benchmarking.html 
b/contributor-guide/benchmarking.html
index f5d907631..1d4782649 100644
--- a/contributor-guide/benchmarking.html
+++ b/contributor-guide/benchmarking.html
@@ -341,6 +341,10 @@ under the License.
 <h1>Comet Benchmarking Guide<a class="headerlink" 
href="#comet-benchmarking-guide" title="Link to this heading">¶</a></h1>
 <p>To track progress on performance, we regularly run benchmarks derived from 
TPC-H and TPC-DS. Data generation and
 benchmarking documentation and scripts are available in the <a 
class="reference external" 
href="https://github.com/apache/datafusion-benchmarks";>DataFusion 
Benchmarks</a> GitHub repository.</p>
+<p>Available benchmarking guides:</p>
+<ul class="simple">
+<li><p><a class="reference internal" href="benchmarking_aws_ec2.html"><span 
class="doc std std-doc">Benchmarking on AWS EC2</span></a></p></li>
+</ul>
 <p>We also have many micro benchmarks that can be run from an IDE located <a 
class="reference external" 
href="https://github.com/apache/datafusion-comet/tree/main/spark/src/test/scala/org/apache/spark/sql/benchmark";>here</a>.</p>
 <section id="current-benchmark-results">
 <h2>Current Benchmark Results<a class="headerlink" 
href="#current-benchmark-results" title="Link to this heading">¶</a></h2>
diff --git a/contributor-guide/benchmarking_aws_ec2.html 
b/contributor-guide/benchmarking_aws_ec2.html
new file mode 100644
index 000000000..ca051abb2
--- /dev/null
+++ b/contributor-guide/benchmarking_aws_ec2.html
@@ -0,0 +1,567 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<!DOCTYPE html>
+
+<html lang="en" data-content_root="../">
+  <head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" 
/><meta name="viewport" content="width=device-width, initial-scale=1" />
+
+    <title>Comet Benchmarking in AWS &#8212; Apache DataFusion Comet  
documentation</title>
+    
+    <link href="../_static/styles/theme.css?digest=1999514e3f237ded88cf" 
rel="stylesheet">
+<link 
href="../_static/styles/pydata-sphinx-theme.css?digest=1999514e3f237ded88cf" 
rel="stylesheet">
+  
+    
+    <link rel="stylesheet"
+      href="../_static/vendor/fontawesome/5.13.0/css/all.min.css">
+    <link rel="preload" as="font" type="font/woff2" crossorigin
+      href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2">
+    <link rel="preload" as="font" type="font/woff2" crossorigin
+      href="../_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2">
+  
+    
+      
+  
+    
+    <link rel="stylesheet" type="text/css" 
href="../_static/pygments.css?v=8f2a1f02" />
+    <link rel="stylesheet" type="text/css" 
href="../_static/styles/pydata-sphinx-theme.css?v=1140d252" />
+    <link rel="stylesheet" type="text/css" 
href="../_static/theme_overrides.css?v=c6d785ac" />
+    
+    <link rel="preload" as="script" 
href="../_static/scripts/pydata-sphinx-theme.js?digest=1999514e3f237ded88cf">
+  
+    <script src="../_static/documentation_options.js?v=8a448e45"></script>
+    <script src="../_static/doctools.js?v=9bcbadda"></script>
+    <script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
+    <script async="true" defer="true" 
src="https://buttons.github.io/buttons.js";></script>
+    <link rel="index" title="Index" href="../genindex.html" />
+    <link rel="search" title="Search" href="../search.html" />
+    <meta name="viewport" content="width=device-width, initial-scale=1" />
+    <meta name="docsearch:language" content="en">
+    
+
+    <!-- Google Analytics -->
+    
+  </head>
+  <body data-spy="scroll" data-target="#bd-toc-nav" data-offset="80">
+    
+    <div class="container-fluid" id="banner"></div>
+
+    
+
+
+    <div class="container-xl">
+      <div class="row">
+          
+            
+            <!-- Only show if we have sidebars configured, else just a small 
margin  -->
+            <div class="col-12 col-md-3 bd-sidebar">
+              <div class="sidebar-start-items"><!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<form class="bd-search d-flex align-items-center" action="../search.html" 
method="get">
+  <i class="icon fas fa-search"></i>
+  <input type="search" class="form-control" name="q" id="search-input" 
placeholder="Search the docs ..." aria-label="Search the docs ..." 
autocomplete="off" >
+</form>
+
+<nav class="bd-links" id="bd-docs-nav" aria-label="Main navigation">
+  <div class="bd-toc-item active">
+    
+    <p aria-level="2" class="caption" role="heading">
+ <span class="caption-text">
+  User Guide
+ </span>
+</p>
+<ul class="nav bd-sidenav">
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/overview.html">
+   Comet Overview
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/installation.html">
+   Installing Comet
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/source.html">
+   Building From Source
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/kubernetes.html">
+   Kubernetes Guide
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/datasources.html">
+   Supported Spark Data Sources
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/datatypes.html">
+   Supported Data Types
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/operators.html">
+   Supported Operators
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/expressions.html">
+   Supported Expressions
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/configs.html">
+   Configuration Settings
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/compatibility.html">
+   Compatibility Guide
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/tuning.html">
+   Tuning Guide
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="../user-guide/metrics.html">
+   Metrics Guide
+  </a>
+ </li>
+</ul>
+<p aria-level="2" class="caption" role="heading">
+ <span class="caption-text">
+  Contributor Guide
+ </span>
+</p>
+<ul class="nav bd-sidenav">
+ <li class="toctree-l1">
+  <a class="reference internal" href="contributing.html">
+   Getting Started
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="plugin_overview.html">
+   Comet Plugin Architecture
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="development.html">
+   Development Guide
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="debugging.html">
+   Debugging Guide
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="benchmarking.html">
+   Benchmarking Guide
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="adding_a_new_expression.html">
+   Adding a New Expression
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="profiling_native_code.html">
+   Profiling Native Code
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference internal" href="spark-sql-tests.html">
+   Spark SQL Tests
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference external" 
href="https://github.com/apache/datafusion-comet";>
+   Github and Issue Tracker
+  </a>
+ </li>
+</ul>
+<p aria-level="2" class="caption" role="heading">
+ <span class="caption-text">
+  ASF Links
+ </span>
+</p>
+<ul class="nav bd-sidenav">
+ <li class="toctree-l1">
+  <a class="reference external" href="https://apache.org";>
+   Apache Software Foundation
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference external" href="https://www.apache.org/licenses/";>
+   License
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference external" 
href="https://www.apache.org/foundation/sponsorship.html";>
+   Donate
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference external" 
href="https://www.apache.org/foundation/thanks.html";>
+   Thanks
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference external" href="https://www.apache.org/security/";>
+   Security
+  </a>
+ </li>
+ <li class="toctree-l1">
+  <a class="reference external" 
href="https://www.apache.org/foundation/policies/conduct.html";>
+   Code of conduct
+  </a>
+ </li>
+</ul>
+
+    
+  </div>
+
+  <a class="navbar-brand" href="../index.html">
+    <img src="../_static/images/DataFusionComet-Logo-Light.png" class="logo" 
alt="logo">
+  </a>
+</nav>
+
+              </div>
+              <div class="sidebar-end-items">
+              </div>
+            </div>
+            
+          
+
+          
+          <div class="d-none d-xl-block col-xl-2 bd-toc">
+            
+              
+              <div class="toc-item">
+                
+<div class="tocsection onthispage pt-5 pb-3">
+    <i class="fas fa-list"></i> On this page
+</div>
+
+<nav id="bd-toc-nav">
+    <ul class="visible nav section-nav flex-column">
+ <li class="toc-h2 nav-item toc-entry">
+  <a class="reference internal nav-link" href="#data-generation">
+   Data Generation
+  </a>
+ </li>
+ <li class="toc-h2 nav-item toc-entry">
+  <a class="reference internal nav-link" href="#install-spark">
+   Install Spark
+  </a>
+ </li>
+ <li class="toc-h2 nav-item toc-entry">
+  <a class="reference internal nav-link" href="#run-spark-benchmarks">
+   Run Spark Benchmarks
+  </a>
+ </li>
+ <li class="toc-h2 nav-item toc-entry">
+  <a class="reference internal nav-link" href="#run-comet-benchmarks">
+   Run Comet Benchmarks
+  </a>
+ </li>
+</ul>
+
+</nav>
+              </div>
+              
+              <div class="toc-item">
+                
+
+<div class="tocsection editthispage">
+    <a 
href="https://github.com/apache/datafusion-comet/edit/main/docs/source/contributor-guide/benchmarking_aws_ec2.md";>
+        <i class="fas fa-pencil-alt"></i> Edit this page
+    </a>
+</div>
+
+              </div>
+              
+            
+          </div>
+          
+
+          
+          
+            
+          
+          <main class="col-12 col-md-9 col-xl-7 py-md-5 pl-md-5 pr-md-4 
bd-content" role="main">
+              
+              <div>
+                
+  <!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<section id="comet-benchmarking-in-aws">
+<h1>Comet Benchmarking in AWS<a class="headerlink" 
href="#comet-benchmarking-in-aws" title="Link to this heading">¶</a></h1>
+<p>This guide is for setting up benchmarks on AWS EC2 with a single node with 
Parquet files located in S3.</p>
+<section id="data-generation">
+<h2>Data Generation<a class="headerlink" href="#data-generation" title="Link 
to this heading">¶</a></h2>
+<ul class="simple">
+<li><p>Create an EC2 instance with an EBS volume sized for approximately 2x 
the size of
+the dataset to be generated (200 GB for scale factor 100, 2 TB for scale 
factor 1000, and so on)</p></li>
+<li><p>Create an S3 bucket to store the Parquet files</p></li>
+</ul>
+<p>Install prerequisites:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>sudo<span class="w"> </span>yum<span 
class="w"> </span>install<span class="w"> </span>-y<span class="w"> 
</span>docker<span class="w"> </span>git<span class="w"> </span>python3-pip
+
+sudo<span class="w"> </span>systemctl<span class="w"> </span>start<span 
class="w"> </span>docker
+sudo<span class="w"> </span>systemctl<span class="w"> </span><span 
class="nb">enable</span><span class="w"> </span>docker
+sudo<span class="w"> </span>usermod<span class="w"> </span>-aG<span class="w"> 
</span>docker<span class="w"> </span>ec2-user
+newgrp<span class="w"> </span>docker
+
+docker<span class="w"> </span>pull<span class="w"> 
</span>ghcr.io/scalytics/tpch-docker:main
+
+pip3<span class="w"> </span>install<span class="w"> </span>datafusion
+</pre></div>
+</div>
+<p>Run the data generation script:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>git<span class="w"> </span>clone<span 
class="w"> </span>https://github.com/apache/datafusion-benchmarks.git
+<span class="nb">cd</span><span class="w"> </span>datafusion-benchmarks/tpch
+nohup<span class="w"> </span>python3<span class="w"> </span>tpchgen.py<span 
class="w"> </span>generate<span class="w"> </span>--scale-factor<span 
class="w"> </span><span class="m">100</span><span class="w"> 
</span>--partitions<span class="w"> </span><span class="m">16</span><span 
class="w"> </span><span class="p">&amp;</span>
+</pre></div>
+</div>
+<p>Check on progress with the following commands:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>docker<span class="w"> </span>ps
+du<span class="w"> </span>-h<span class="w"> </span>-d<span class="w"> 
</span><span class="m">1</span><span class="w"> </span>data
+</pre></div>
+</div>
+<p>Fix ownership in the generated files:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>sudo<span class="w"> </span>chown<span 
class="w"> </span>-R<span class="w"> </span>ec2-user:docker<span class="w"> 
</span>data
+</pre></div>
+</div>
+<p>Convert to Parquet:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>nohup<span class="w"> </span>python3<span 
class="w"> </span>tpchgen.py<span class="w"> </span>convert<span class="w"> 
</span>--scale-factor<span class="w"> </span><span class="m">100</span><span 
class="w"> </span>--partitions<span class="w"> </span><span 
class="m">16</span><span class="w"> </span><span class="p">&amp;</span>
+</pre></div>
+</div>
+<p>Delete the CSV files:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span><span class="nb">cd</span><span class="w"> 
</span>data
+rm<span class="w"> </span>*.tbl.*
+</pre></div>
+</div>
+<p>Copy the Parquet files to S3:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>aws<span class="w"> </span>s3<span 
class="w"> </span>cp<span class="w"> </span>.<span class="w"> 
</span>s3://your-bucket-name/top-level-folder/<span class="w"> 
</span>--recursive
+</pre></div>
+</div>
+</section>
+<section id="install-spark">
+<h2>Install Spark<a class="headerlink" href="#install-spark" title="Link to 
this heading">¶</a></h2>
+<p>Install Java</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>sudo<span class="w"> </span>yum<span 
class="w"> </span>install<span class="w"> </span>-y<span class="w"> 
</span>java-17-amazon-corretto-headless<span class="w"> 
</span>java-17-amazon-corretto-devel
+</pre></div>
+</div>
+<p>Set JAVA_HOME</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span><span class="nb">export</span><span 
class="w"> </span><span class="nv">JAVA_HOME</span><span 
class="o">=</span>/usr/lib/jvm/java-17-amazon-corretto
+</pre></div>
+</div>
+<p>Install Spark</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>wget<span class="w"> 
</span>https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
+tar<span class="w"> </span>xzf<span class="w"> 
</span>spark-3.5.4-bin-hadoop3.tgz
+sudo<span class="w"> </span>mv<span class="w"> 
</span>spark-3.5.4-bin-hadoop3<span class="w"> </span>/opt
+<span class="nb">export</span><span class="w"> </span><span 
class="nv">SPARK_HOME</span><span 
class="o">=</span>/opt/spark-3.5.4-bin-hadoop3/
+mkdir<span class="w"> </span>/tmp/spark-events
+</pre></div>
+</div>
+<p>Set <code class="docutils literal notranslate"><span 
class="pre">SPARK_MASTER</span></code> env var (IP address will need to be 
edited):</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span><span class="nb">export</span><span 
class="w"> </span><span class="nv">SPARK_MASTER</span><span 
class="o">=</span>spark://172.31.34.87:7077
+</pre></div>
+</div>
+<p>Set <code class="docutils literal notranslate"><span 
class="pre">SPARK_LOCAL_DIRS</span></code> to point to EBS volume</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>sudo<span class="w"> </span>mkdir<span 
class="w"> </span>/mnt/tmp
+sudo<span class="w"> </span>chmod<span class="w"> </span><span 
class="m">777</span><span class="w"> </span>/mnt/tmp
+mv<span class="w"> </span><span 
class="nv">$SPARK_HOME</span>/conf/spark-env.sh.template<span class="w"> 
</span><span class="nv">$SPARK_HOME</span>/conf/spark-env.sh
+</pre></div>
+</div>
+<p>Add the following entry to <code class="docutils literal notranslate"><span 
class="pre">spark-env.sh</span></code>:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span><span 
class="nv">SPARK_LOCAL_DIRS</span><span class="o">=</span>/mnt/tmp
+</pre></div>
+</div>
+<p>Start Spark in standalone mode:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span><span 
class="nv">$SPARK_HOME</span>/sbin/start-master.sh
+<span class="nv">$SPARK_HOME</span>/sbin/start-worker.sh<span class="w"> 
</span><span class="nv">$SPARK_MASTER</span>
+</pre></div>
+</div>
+<p>Install Hadoop jar files:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>wget<span class="w"> 
</span>https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar<span
 class="w"> </span>-P<span class="w"> </span><span 
class="nv">$SPARK_HOME</span>/jars
+wget<span class="w"> 
</span>https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar<span
 class="w"> </span>-P<span class="w"> </span><span 
class="nv">$SPARK_HOME</span>/jars
+</pre></div>
+</div>
+<p>Add credentials to <code class="docutils literal notranslate"><span 
class="pre">~/.aws/credentials</span></code>:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span><span class="o">[</span>default<span 
class="o">]</span>
+<span class="nv">aws_access_key_id</span><span 
class="o">=</span>your-access-key
+<span class="nv">aws_secret_access_key</span><span 
class="o">=</span>your-secret-key
+</pre></div>
+</div>
+</section>
+<section id="run-spark-benchmarks">
+<h2>Run Spark Benchmarks<a class="headerlink" href="#run-spark-benchmarks" 
title="Link to this heading">¶</a></h2>
+<p>Run the following command (the <code class="docutils literal 
notranslate"><span class="pre">--data</span></code> parameter will need to be 
updated to point to your S3 bucket):</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span><span 
class="nv">$SPARK_HOME</span>/bin/spark-submit<span class="w"> </span><span 
class="se">\</span>
+<span class="w">  </span>--master<span class="w"> </span><span 
class="nv">$SPARK_MASTER</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.driver.memory<span class="o">=</span>4G<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.executor.instances<span class="o">=</span><span 
class="m">1</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.executor.cores<span class="o">=</span><span 
class="m">8</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> </span>spark.cores.max<span 
class="o">=</span><span class="m">8</span><span class="w"> </span><span 
class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.executor.memory<span class="o">=</span>16g<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.eventLog.enabled<span class="o">=</span><span 
class="nb">false</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> </span>spark.local.dir<span 
class="o">=</span>/mnt/tmp<span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.driver.extraJavaOptions<span class="o">=</span><span 
class="s2">&quot;-Djava.io.tmpdir=/mnt/tmp&quot;</span><span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.executor.extraJavaOptions<span class="o">=</span><span 
class="s2">&quot;-Djava.io.tmpdir=/mnt/tmp&quot;</span><span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.hadoop.fs.s3a.impl<span 
class="o">=</span>org.apache.hadoop.fs.s3a.S3AFileSystem<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.hadoop.fs.s3a.aws.credentials.provider<span 
class="o">=</span>com.amazonaws.auth.DefaultAWSCredentialsProviderChain<span 
class="w"> </span><span class="se">\</span>
+<span class="w">  </span>tpcbench.py<span class="w"> </span><span 
class="se">\</span>
+<span class="w">  </span>--benchmark<span class="w"> </span>tpch<span 
class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--data<span class="w"> 
</span>s3a://your-bucket-name/top-level-folder<span class="w"> </span><span 
class="se">\</span>
+<span class="w">  </span>--queries<span class="w"> 
</span>/home/ec2-user/datafusion-benchmarks/tpch/queries<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--output<span class="w"> </span>.<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--iterations<span class="w"> </span><span 
class="m">1</span>
+</pre></div>
+</div>
+</section>
+<section id="run-comet-benchmarks">
+<h2>Run Comet Benchmarks<a class="headerlink" href="#run-comet-benchmarks" 
title="Link to this heading">¶</a></h2>
+<p>Install Comet JAR from Maven:</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span>wget<span class="w"> 
</span>https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.7.0/comet-spark-spark3.5_2.12-0.7.0.jar<span
 class="w"> </span>-P<span class="w"> </span><span 
class="nv">$SPARK_HOME</span>/jars
+<span class="nb">export</span><span class="w"> </span><span 
class="nv">COMET_JAR</span><span class="o">=</span><span 
class="nv">$SPARK_HOME</span>/jars/comet-spark-spark3.5_2.12-0.7.0.jar
+</pre></div>
+</div>
+<p>Run the following command (the <code class="docutils literal 
notranslate"><span class="pre">--data</span></code> parameter will need to be 
updated to point to your S3 bucket):</p>
+<div class="highlight-shell notranslate"><div 
class="highlight"><pre><span></span><span 
class="nv">$SPARK_HOME</span>/bin/spark-submit<span class="w"> </span><span 
class="se">\</span>
+<span class="w">  </span>--master<span class="w"> </span><span 
class="nv">$SPARK_MASTER</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.driver.memory<span class="o">=</span>4G<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.executor.instances<span class="o">=</span><span 
class="m">1</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.executor.cores<span class="o">=</span><span 
class="m">8</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> </span>spark.cores.max<span 
class="o">=</span><span class="m">8</span><span class="w"> </span><span 
class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.executor.memory<span class="o">=</span>16g<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.memory.offHeap.enabled<span class="o">=</span><span 
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.memory.offHeap.size<span class="o">=</span>16g<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.eventLog.enabled<span class="o">=</span><span 
class="nb">false</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> </span>spark.local.dir<span 
class="o">=</span>/mnt/tmp<span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.driver.extraJavaOptions<span class="o">=</span><span 
class="s2">&quot;-Djava.io.tmpdir=/mnt/tmp&quot;</span><span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.executor.extraJavaOptions<span class="o">=</span><span 
class="s2">&quot;-Djava.io.tmpdir=/mnt/tmp&quot;</span><span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.hadoop.fs.s3a.impl<span 
class="o">=</span>org.apache.hadoop.fs.s3a.S3AFileSystem<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.hadoop.fs.s3a.aws.credentials.provider<span 
class="o">=</span>com.amazonaws.auth.DefaultAWSCredentialsProviderChain<span 
class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--jars<span class="w"> </span><span 
class="nv">$COMET_JAR</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--driver-class-path<span class="w"> </span><span 
class="nv">$COMET_JAR</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.driver.extraClassPath<span class="o">=</span><span 
class="nv">$COMET_JAR</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.executor.extraClassPath<span class="o">=</span><span 
class="nv">$COMET_JAR</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> </span>spark.plugins<span 
class="o">=</span>org.apache.spark.CometPlugin<span class="w"> </span><span 
class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.shuffle.manager<span 
class="o">=</span>org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager<span
 class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.comet.enabled<span class="o">=</span><span 
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.comet.cast.allowIncompatible<span class="o">=</span><span 
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.comet.exec.replaceSortMergeJoin<span class="o">=</span><span 
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.comet.exec.shuffle.enabled<span class="o">=</span><span 
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.comet.exec.shuffle.fallbackToColumnar<span class="o">=</span><span 
class="nb">true</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.comet.exec.shuffle.compression.codec<span 
class="o">=</span>lz4<span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--conf<span class="w"> 
</span>spark.comet.exec.shuffle.compression.level<span class="o">=</span><span 
class="m">1</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>tpcbench.py<span class="w"> </span><span 
class="se">\</span>
+<span class="w">  </span>--benchmark<span class="w"> </span>tpch<span 
class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--data<span class="w"> 
</span>s3a://your-bucket-name/top-level-folder<span class="w"> </span><span 
class="se">\</span>
+<span class="w">  </span>--queries<span class="w"> 
</span>/home/ec2-user/datafusion-benchmarks/tpch/queries<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--output<span class="w"> </span>.<span class="w"> 
</span><span class="se">\</span>
+<span class="w">  </span>--iterations<span class="w"> </span><span 
class="m">1</span>
+</pre></div>
+</div>
+</section>
+</section>
+
+
+              </div>
+              
+              
+              <!-- Previous / next buttons -->
+<div class='prev-next-area'>
+</div>
+              
+          </main>
+          
+
+      </div>
+    </div>
+  
+    <script 
src="../_static/scripts/pydata-sphinx-theme.js?digest=1999514e3f237ded88cf"></script>
+  
+<!-- Based on pydata_sphinx_theme/footer.html -->
+<footer class="footer mt-5 mt-md-0">
+  <div class="container">
+    
+    <div class="footer-item">
+      <p class="copyright">
+    &copy; Copyright 2023-2024, Apache Software Foundation.<br>
+</p>
+    </div>
+    
+    <div class="footer-item">
+      <p class="sphinx-version">
+Created using <a href="http://sphinx-doc.org/";>Sphinx</a> 8.1.3.<br>
+</p>
+    </div>
+    
+    <div class="footer-item">
+      <p>Apache DataFusion, Apache DataFusion Comet, Apache, the Apache 
feather logo, and the Apache DataFusion project logo</p>
+      <p>are either registered trademarks or trademarks of The Apache Software 
Foundation in the United States and other countries.</p>
+    </div>
+  </div>
+</footer>
+
+
+  </body>
+</html>
\ No newline at end of file
diff --git a/objects.inv b/objects.inv
index ff80537dc..990ca2ba0 100644
Binary files a/objects.inv and b/objects.inv differ
diff --git a/searchindex.js b/searchindex.js
index 0a194632f..35b3f52da 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"alltitles": {"1. Install Comet": [[9, "install-comet"]], "2. 
Clone Spark and Apply Diff": [[9, "clone-spark-and-apply-diff"]], "3. Run Spark 
SQL Tests": [[9, "run-spark-sql-tests"]], "ANSI mode": [[11, "ansi-mode"]], 
"API Differences Between Spark Versions": [[0, 
"api-differences-between-spark-versions"]], "ASF Links": [[10, null]], "Adding 
Spark-side Tests for the New Expression": [[0, 
"adding-spark-side-tests-for-the-new-expression"]], "Adding a New Expression": 
[[0,  [...]
\ No newline at end of file
+Search.setIndex({"alltitles": {"1. Install Comet": [[10, "install-comet"]], 
"2. Clone Spark and Apply Diff": [[10, "clone-spark-and-apply-diff"]], "3. Run 
Spark SQL Tests": [[10, "run-spark-sql-tests"]], "ANSI mode": [[12, 
"ansi-mode"]], "API Differences Between Spark Versions": [[0, 
"api-differences-between-spark-versions"]], "ASF Links": [[11, null]], "Adding 
Spark-side Tests for the New Expression": [[0, 
"adding-spark-side-tests-for-the-new-expression"]], "Adding a New Expression": 
[[ [...]
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion-comet) branch asf-site updated: Publish built docs triggered by e2383921f714c857c31bbbc1a2f427bb0608b46c

Reply via email to