This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 1b212f5d Publish built docs triggered by
9dfd6d110b329d47a505a0fee73a9f3e898c4929
1b212f5d is described below
commit 1b212f5d3e4d8b436dbce84e96600fb904dd4781
Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AuthorDate: Thu Sep 19 19:23:11 2024 +0000
Publish built docs triggered by 9dfd6d110b329d47a505a0fee73a9f3e898c4929
---
_sources/contributor-guide/plugin_overview.md.txt | 55 +++++++++++++----
_static/images/CometNativeExecution.drawio.png | Bin 0 -> 61017 bytes
_static/images/CometNativeParquetScan.drawio.png | Bin 0 -> 75703 bytes
contributor-guide/contributing.html | 4 +-
contributor-guide/development.html | 4 +-
contributor-guide/plugin_overview.html | 70 +++++++++++++++++-----
objects.inv | Bin 745 -> 751 bytes
searchindex.js | 2 +-
8 files changed, 103 insertions(+), 32 deletions(-)
diff --git a/_sources/contributor-guide/plugin_overview.md.txt
b/_sources/contributor-guide/plugin_overview.md.txt
index 9e6a104b..c7538290 100644
--- a/_sources/contributor-guide/plugin_overview.md.txt
+++ b/_sources/contributor-guide/plugin_overview.md.txt
@@ -17,30 +17,41 @@ specific language governing permissions and limitations
under the License.
-->
-# Comet Plugin Overview
+# Comet Plugin Architecture
-The entry point to Comet is the `org.apache.spark.CometPlugin` class, which
can be registered with Spark by adding the following setting to the Spark
configuration when launching `spark-shell` or `spark-submit`:
+## Comet SQL Plugin
+
+The entry point to Comet is the `org.apache.spark.CometPlugin` class, which
can be registered with Spark by adding the
+following setting to the Spark configuration when launching `spark-shell` or
`spark-submit`:
```
--conf spark.plugins=org.apache.spark.CometPlugin
```
-On initialization, this class registers two physical plan optimization rules
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a
query stage is being planned.
+On initialization, this class registers two physical plan optimization rules
with Spark: `CometScanRule`
+and `CometExecRule`. These rules run whenever a query stage is being planned
during Adaptive Query Execution, and
+run once for the entire plan when Adaptive Query Execution is disabled.
## CometScanRule
-`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+`CometScanRule` replaces any Parquet scans with Comet operators. There are
different paths for Spark v1 and v2 data sources.
-When the V1 data source API is being used, `FileSourceScanExec` is replaced
with `CometScanExec`.
+When reading from Parquet v1 data sources, Comet replaces `FileSourceScanExec`
with a `CometScanExec`, and for v2
+data sources, `BatchScanExec` is replaced with `CometBatchScanExec`. In both
cases, Comet replaces Spark's Parquet
+reader with a custom vectorized Parquet reader. This is similar to Spark's
vectorized Parquet reader used by the v2
+Parquet data source but leverages native code for decoding Parquet row groups
directly into Arrow format.
-When the V2 data source API is being used, `BatchScanExec` is replaced with
`CometBatchScanExec`.
+Comet only supports a subset of data types and will fall back to Spark's scan
if unsupported types
+exist. Comet can still accelerate the rest of the query execution in this case
because `CometSparkToColumnarExec` will
+convert the output from Spark's can to Arrow arrays. Note that both
`spark.comet.exec.enabled=true` and
+`spark.comet.convert.parquet.enabled=true` must be set to enable this
conversion.
-## CometExecRule
+Refer to the [Supported Spark Data
Types](https://datafusion.apache.org/comet/user-guide/datatypes.html) section
+in the contributor guide to see a list of currently supported data types.
-`CometExecRule` attempts to transform a Spark physical plan into a Comet plan.
This rule is executed against
-individual query stages when they are being prepared for execution.
+## CometExecRule
-This rule traverses bottom-up from the original Spark plan and attempts to
replace each node with a Comet equivalent.
+This rule traverses bottom-up from the original Spark plan and attempts to
replace each operator with a Comet equivalent.
For example, a `ProjectExec` will be replaced by `CometProjectExec`.
When replacing a node, various checks are performed to determine if Comet can
support the operator and its expressions.
@@ -51,9 +62,27 @@ Comet does not support partially replacing subsets of the
plan within a query st
transitions to convert between row-based and columnar data between Spark
operators and Comet operators and the overhead
of this could outweigh the benefits of running parts of the query stage
natively in Comet.
-Once the plan has been transformed, it is serialized into Comet protocol
buffer format by the `QueryPlanSerde` class
-and this serialized plan is passed into the native code by `CometExecIterator`.
+## Query Execution
+
+Once the plan has been transformed, any consecutive Comet operators are
combined into a `CometNativeExec` which contains
+a serialized version of the plan (the serialization code can be found in
`QueryPlanSerde`). When this operator is
+executed, the serialized plan is passed to the native code when calling
`Native.createPlan`.
In the native code there is a `PhysicalPlanner` struct (in `planner.rs`) which
converts the serialized plan into an
-Apache DataFusion physical plan. In some cases, Comet provides specialized
physical operators and expressions to
+Apache DataFusion `ExecutionPlan`. In some cases, Comet provides specialized
physical operators and expressions to
override the DataFusion versions to ensure compatibility with Apache Spark.
+
+`CometExecIterator` will invoke `Native.executePlan` to pull the next batch
from the native plan. This is repeated
+until no more batches are available (meaning that all data has been processed
by the native plan).
+
+The leaf nodes in the physical plan are always `ScanExec` and these operators
consume batches of Arrow data that were
+prepared before the plan is executed. When `CometExecIterator` invokes
`Native.executePlan` it passes the memory
+addresses of these Arrow arrays to the native code.
+
+
+
+## End to End Flow
+
+The following diagram shows the end-to-end flow.
+
+
diff --git a/_static/images/CometNativeExecution.drawio.png
b/_static/images/CometNativeExecution.drawio.png
new file mode 100644
index 00000000..ba122a1f
Binary files /dev/null and b/_static/images/CometNativeExecution.drawio.png
differ
diff --git a/_static/images/CometNativeParquetScan.drawio.png
b/_static/images/CometNativeParquetScan.drawio.png
new file mode 100644
index 00000000..712cbae4
Binary files /dev/null and b/_static/images/CometNativeParquetScan.drawio.png
differ
diff --git a/contributor-guide/contributing.html
b/contributor-guide/contributing.html
index 6e2c576f..b53f1435 100644
--- a/contributor-guide/contributing.html
+++ b/contributor-guide/contributing.html
@@ -53,7 +53,7 @@ under the License.
<script async="true" defer="true"
src="https://buttons.github.io/buttons.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
- <link rel="next" title="Comet Plugin Overview" href="plugin_overview.html"
/>
+ <link rel="next" title="Comet Plugin Architecture"
href="plugin_overview.html" />
<link rel="prev" title="Tuning Guide" href="../user-guide/tuning.html" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="docsearch:language" content="en">
@@ -384,7 +384,7 @@ coordinate on issues that they are working on.</p>
<a class='right-next' id="next-link" href="plugin_overview.html"
title="next page">
<div class="prev-next-info">
<p class="prev-next-subtitle">next</p>
- <p class="prev-next-title">Comet Plugin Overview</p>
+ <p class="prev-next-title">Comet Plugin Architecture</p>
</div>
<i class="fas fa-angle-right"></i>
</a>
diff --git a/contributor-guide/development.html
b/contributor-guide/development.html
index 164e54c6..914965e0 100644
--- a/contributor-guide/development.html
+++ b/contributor-guide/development.html
@@ -54,7 +54,7 @@ under the License.
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Comet Debugging Guide" href="debugging.html" />
- <link rel="prev" title="Comet Plugin Overview" href="plugin_overview.html"
/>
+ <link rel="prev" title="Comet Plugin Architecture"
href="plugin_overview.html" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="docsearch:language" content="en">
@@ -504,7 +504,7 @@ automatically format the code. Before submitting a pull
request, you can simply
<i class="fas fa-angle-left"></i>
<div class="prev-next-info">
<p class="prev-next-subtitle">previous</p>
- <p class="prev-next-title">Comet Plugin Overview</p>
+ <p class="prev-next-title">Comet Plugin Architecture</p>
</div>
</a>
<a class='right-next' id="next-link" href="debugging.html" title="next
page">
diff --git a/contributor-guide/plugin_overview.html
b/contributor-guide/plugin_overview.html
index fc9e7f97..3336c8d9 100644
--- a/contributor-guide/plugin_overview.html
+++ b/contributor-guide/plugin_overview.html
@@ -24,7 +24,7 @@ under the License.
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0"
/><meta name="viewport" content="width=device-width, initial-scale=1" />
- <title>Comet Plugin Overview — Apache DataFusion Comet
documentation</title>
+ <title>Comet Plugin Architecture — Apache DataFusion Comet
documentation</title>
<link href="../_static/styles/theme.css?digest=1999514e3f237ded88cf"
rel="stylesheet">
<link
href="../_static/styles/pydata-sphinx-theme.css?digest=1999514e3f237ded88cf"
rel="stylesheet">
@@ -271,6 +271,11 @@ under the License.
<nav id="bd-toc-nav">
<ul class="visible nav section-nav flex-column">
+ <li class="toc-h2 nav-item toc-entry">
+ <a class="reference internal nav-link" href="#comet-sql-plugin">
+ Comet SQL Plugin
+ </a>
+ </li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#cometscanrule">
CometScanRule
@@ -281,6 +286,16 @@ under the License.
CometExecRule
</a>
</li>
+ <li class="toc-h2 nav-item toc-entry">
+ <a class="reference internal nav-link" href="#query-execution">
+ Query Execution
+ </a>
+ </li>
+ <li class="toc-h2 nav-item toc-entry">
+ <a class="reference internal nav-link" href="#end-to-end-flow">
+ End to End Flow
+ </a>
+ </li>
</ul>
</nav>
@@ -327,24 +342,36 @@ KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
-<section id="comet-plugin-overview">
-<h1>Comet Plugin Overview<a class="headerlink" href="#comet-plugin-overview"
title="Link to this heading">¶</a></h1>
-<p>The entry point to Comet is the <code class="docutils literal
notranslate"><span class="pre">org.apache.spark.CometPlugin</span></code>
class, which can be registered with Spark by adding the following setting to
the Spark configuration when launching <code class="docutils literal
notranslate"><span class="pre">spark-shell</span></code> or <code
class="docutils literal notranslate"><span
class="pre">spark-submit</span></code>:</p>
+<section id="comet-plugin-architecture">
+<h1>Comet Plugin Architecture<a class="headerlink"
href="#comet-plugin-architecture" title="Link to this heading">¶</a></h1>
+<section id="comet-sql-plugin">
+<h2>Comet SQL Plugin<a class="headerlink" href="#comet-sql-plugin" title="Link
to this heading">¶</a></h2>
+<p>The entry point to Comet is the <code class="docutils literal
notranslate"><span class="pre">org.apache.spark.CometPlugin</span></code>
class, which can be registered with Spark by adding the
+following setting to the Spark configuration when launching <code
class="docutils literal notranslate"><span
class="pre">spark-shell</span></code> or <code class="docutils literal
notranslate"><span class="pre">spark-submit</span></code>:</p>
<div class="highlight-default notranslate"><div
class="highlight"><pre><span></span><span class="o">--</span><span
class="n">conf</span> <span class="n">spark</span><span class="o">.</span><span
class="n">plugins</span><span class="o">=</span><span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">spark</span><span class="o">.</span><span class="n">CometPlugin</span>
</pre></div>
</div>
-<p>On initialization, this class registers two physical plan optimization
rules with Spark: <code class="docutils literal notranslate"><span
class="pre">CometScanRule</span></code> and <code class="docutils literal
notranslate"><span class="pre">CometExecRule</span></code>. These rules run
whenever a query stage is being planned.</p>
+<p>On initialization, this class registers two physical plan optimization
rules with Spark: <code class="docutils literal notranslate"><span
class="pre">CometScanRule</span></code>
+and <code class="docutils literal notranslate"><span
class="pre">CometExecRule</span></code>. These rules run whenever a query stage
is being planned during Adaptive Query Execution, and
+run once for the entire plan when Adaptive Query Execution is disabled.</p>
+</section>
<section id="cometscanrule">
<h2>CometScanRule<a class="headerlink" href="#cometscanrule" title="Link to
this heading">¶</a></h2>
-<p><code class="docutils literal notranslate"><span
class="pre">CometScanRule</span></code> replaces any Parquet scans with Comet
Parquet scan classes.</p>
-<p>When the V1 data source API is being used, <code class="docutils literal
notranslate"><span class="pre">FileSourceScanExec</span></code> is replaced
with <code class="docutils literal notranslate"><span
class="pre">CometScanExec</span></code>.</p>
-<p>When the V2 data source API is being used, <code class="docutils literal
notranslate"><span class="pre">BatchScanExec</span></code> is replaced with
<code class="docutils literal notranslate"><span
class="pre">CometBatchScanExec</span></code>.</p>
+<p><code class="docutils literal notranslate"><span
class="pre">CometScanRule</span></code> replaces any Parquet scans with Comet
operators. There are different paths for Spark v1 and v2 data sources.</p>
+<p>When reading from Parquet v1 data sources, Comet replaces <code
class="docutils literal notranslate"><span
class="pre">FileSourceScanExec</span></code> with a <code class="docutils
literal notranslate"><span class="pre">CometScanExec</span></code>, and for v2
+data sources, <code class="docutils literal notranslate"><span
class="pre">BatchScanExec</span></code> is replaced with <code class="docutils
literal notranslate"><span class="pre">CometBatchScanExec</span></code>. In
both cases, Comet replaces Spark’s Parquet
+reader with a custom vectorized Parquet reader. This is similar to Spark’s
vectorized Parquet reader used by the v2
+Parquet data source but leverages native code for decoding Parquet row groups
directly into Arrow format.</p>
+<p>Comet only supports a subset of data types and will fall back to Spark’s
scan if unsupported types
+exist. Comet can still accelerate the rest of the query execution in this case
because <code class="docutils literal notranslate"><span
class="pre">CometSparkToColumnarExec</span></code> will
+convert the output from Spark’s can to Arrow arrays. Note that both <code
class="docutils literal notranslate"><span
class="pre">spark.comet.exec.enabled=true</span></code> and
+<code class="docutils literal notranslate"><span
class="pre">spark.comet.convert.parquet.enabled=true</span></code> must be set
to enable this conversion.</p>
+<p>Refer to the <a class="reference external"
href="https://datafusion.apache.org/comet/user-guide/datatypes.html">Supported
Spark Data Types</a> section
+in the contributor guide to see a list of currently supported data types.</p>
</section>
<section id="cometexecrule">
<h2>CometExecRule<a class="headerlink" href="#cometexecrule" title="Link to
this heading">¶</a></h2>
-<p><code class="docutils literal notranslate"><span
class="pre">CometExecRule</span></code> attempts to transform a Spark physical
plan into a Comet plan. This rule is executed against
-individual query stages when they are being prepared for execution.</p>
-<p>This rule traverses bottom-up from the original Spark plan and attempts to
replace each node with a Comet equivalent.
+<p>This rule traverses bottom-up from the original Spark plan and attempts to
replace each operator with a Comet equivalent.
For example, a <code class="docutils literal notranslate"><span
class="pre">ProjectExec</span></code> will be replaced by <code class="docutils
literal notranslate"><span class="pre">CometProjectExec</span></code>.</p>
<p>When replacing a node, various checks are performed to determine if Comet
can support the operator and its expressions.
If an operator, expression, or data type is not supported by Comet then the
reason will be stored in a tag on the
@@ -352,11 +379,26 @@ underlying Spark node and the plan will not be
converted.</p>
<p>Comet does not support partially replacing subsets of the plan within a
query stage because this would involve adding
transitions to convert between row-based and columnar data between Spark
operators and Comet operators and the overhead
of this could outweigh the benefits of running parts of the query stage
natively in Comet.</p>
-<p>Once the plan has been transformed, it is serialized into Comet protocol
buffer format by the <code class="docutils literal notranslate"><span
class="pre">QueryPlanSerde</span></code> class
-and this serialized plan is passed into the native code by <code
class="docutils literal notranslate"><span
class="pre">CometExecIterator</span></code>.</p>
+</section>
+<section id="query-execution">
+<h2>Query Execution<a class="headerlink" href="#query-execution" title="Link
to this heading">¶</a></h2>
+<p>Once the plan has been transformed, any consecutive Comet operators are
combined into a <code class="docutils literal notranslate"><span
class="pre">CometNativeExec</span></code> which contains
+a serialized version of the plan (the serialization code can be found in <code
class="docutils literal notranslate"><span
class="pre">QueryPlanSerde</span></code>). When this operator is
+executed, the serialized plan is passed to the native code when calling <code
class="docutils literal notranslate"><span
class="pre">Native.createPlan</span></code>.</p>
<p>In the native code there is a <code class="docutils literal
notranslate"><span class="pre">PhysicalPlanner</span></code> struct (in <code
class="docutils literal notranslate"><span
class="pre">planner.rs</span></code>) which converts the serialized plan into an
-Apache DataFusion physical plan. In some cases, Comet provides specialized
physical operators and expressions to
+Apache DataFusion <code class="docutils literal notranslate"><span
class="pre">ExecutionPlan</span></code>. In some cases, Comet provides
specialized physical operators and expressions to
override the DataFusion versions to ensure compatibility with Apache Spark.</p>
+<p><code class="docutils literal notranslate"><span
class="pre">CometExecIterator</span></code> will invoke <code class="docutils
literal notranslate"><span class="pre">Native.executePlan</span></code> to pull
the next batch from the native plan. This is repeated
+until no more batches are available (meaning that all data has been processed
by the native plan).</p>
+<p>The leaf nodes in the physical plan are always <code class="docutils
literal notranslate"><span class="pre">ScanExec</span></code> and these
operators consume batches of Arrow data that were
+prepared before the plan is executed. When <code class="docutils literal
notranslate"><span class="pre">CometExecIterator</span></code> invokes <code
class="docutils literal notranslate"><span
class="pre">Native.executePlan</span></code> it passes the memory
+addresses of these Arrow arrays to the native code.</p>
+<p><img alt="Diagram of Comet Native Execution"
src="../_static/images/CometNativeExecution.drawio.png" /></p>
+</section>
+<section id="end-to-end-flow">
+<h2>End to End Flow<a class="headerlink" href="#end-to-end-flow" title="Link
to this heading">¶</a></h2>
+<p>The following diagram shows the end-to-end flow.</p>
+<p><img alt="Diagram of Comet Native Parquet Scan"
src="../_static/images/CometNativeParquetScan.drawio.png" /></p>
</section>
</section>
diff --git a/objects.inv b/objects.inv
index 0523a739..013792d3 100644
Binary files a/objects.inv and b/objects.inv differ
diff --git a/searchindex.js b/searchindex.js
index 3cbbe3f7..b023388c 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"alltitles": {"1. Install Comet": [[9, "install-comet"]], "2.
Clone Spark and Apply Diff": [[9, "clone-spark-and-apply-diff"]], "3. Run Spark
SQL Tests": [[9, "run-spark-sql-tests"]], "ANSI mode": [[11, "ansi-mode"]],
"API Differences Between Spark Versions": [[0,
"api-differences-between-spark-versions"]], "ASF Links": [[10, null]], "Adding
Spark-side Tests for the New Expression": [[0,
"adding-spark-side-tests-for-the-new-expression"]], "Adding a New Expression":
[[0, [...]
\ No newline at end of file
+Search.setIndex({"alltitles": {"1. Install Comet": [[9, "install-comet"]], "2.
Clone Spark and Apply Diff": [[9, "clone-spark-and-apply-diff"]], "3. Run Spark
SQL Tests": [[9, "run-spark-sql-tests"]], "ANSI mode": [[11, "ansi-mode"]],
"API Differences Between Spark Versions": [[0,
"api-differences-between-spark-versions"]], "ASF Links": [[10, null]], "Adding
Spark-side Tests for the New Expression": [[0,
"adding-spark-side-tests-for-the-new-expression"]], "Adding a New Expression":
[[0, [...]
\ No newline at end of file
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]