This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 1b212f5d Publish built docs triggered by 
9dfd6d110b329d47a505a0fee73a9f3e898c4929
1b212f5d is described below

commit 1b212f5d3e4d8b436dbce84e96600fb904dd4781
Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AuthorDate: Thu Sep 19 19:23:11 2024 +0000

    Publish built docs triggered by 9dfd6d110b329d47a505a0fee73a9f3e898c4929
---
 _sources/contributor-guide/plugin_overview.md.txt |  55 +++++++++++++----
 _static/images/CometNativeExecution.drawio.png    | Bin 0 -> 61017 bytes
 _static/images/CometNativeParquetScan.drawio.png  | Bin 0 -> 75703 bytes
 contributor-guide/contributing.html               |   4 +-
 contributor-guide/development.html                |   4 +-
 contributor-guide/plugin_overview.html            |  70 +++++++++++++++++-----
 objects.inv                                       | Bin 745 -> 751 bytes
 searchindex.js                                    |   2 +-
 8 files changed, 103 insertions(+), 32 deletions(-)

diff --git a/_sources/contributor-guide/plugin_overview.md.txt 
b/_sources/contributor-guide/plugin_overview.md.txt
index 9e6a104b..c7538290 100644
--- a/_sources/contributor-guide/plugin_overview.md.txt
+++ b/_sources/contributor-guide/plugin_overview.md.txt
@@ -17,30 +17,41 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Comet Plugin Overview
+# Comet Plugin Architecture
 
-The entry point to Comet is the `org.apache.spark.CometPlugin` class, which 
can be registered with Spark by adding the following setting to the Spark 
configuration when launching `spark-shell` or `spark-submit`:
+## Comet SQL Plugin
+
+The entry point to Comet is the `org.apache.spark.CometPlugin` class, which 
can be registered with Spark by adding the
+following setting to the Spark configuration when launching `spark-shell` or 
`spark-submit`:
 
 ```
 --conf spark.plugins=org.apache.spark.CometPlugin
 ```
 
-On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a 
query stage is being planned.
+On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule`
+and `CometExecRule`. These rules run whenever a query stage is being planned 
during Adaptive Query Execution, and
+run once for the entire plan when Adaptive Query Execution is disabled.
 
 ## CometScanRule
 
-`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+`CometScanRule` replaces any Parquet scans with Comet operators. There are 
different paths for Spark v1 and v2 data sources.
 
-When the V1 data source API is being used, `FileSourceScanExec` is replaced 
with `CometScanExec`.
+When reading from Parquet v1 data sources, Comet replaces `FileSourceScanExec` 
with a `CometScanExec`, and for v2
+data sources, `BatchScanExec` is replaced with `CometBatchScanExec`. In both 
cases, Comet replaces Spark's Parquet
+reader with a custom vectorized Parquet reader. This is similar to Spark's 
vectorized Parquet reader used by the v2
+Parquet data source but leverages native code for decoding Parquet row groups 
directly into Arrow format.
 
-When the V2 data source API is being used, `BatchScanExec` is replaced with 
`CometBatchScanExec`.
+Comet only supports a subset of data types and will fall back to Spark's scan 
if unsupported types
+exist. Comet can still accelerate the rest of the query execution in this case 
because `CometSparkToColumnarExec` will
+convert the output from Spark's can to Arrow arrays. Note that both 
`spark.comet.exec.enabled=true` and
+`spark.comet.convert.parquet.enabled=true` must be set to enable this 
conversion.
 
-## CometExecRule
+Refer to the [Supported Spark Data 
Types](https://datafusion.apache.org/comet/user-guide/datatypes.html) section
+in the contributor guide to see a list of currently supported data types.
 
-`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. 
This rule is executed against
-individual query stages when they are being prepared for execution.
+## CometExecRule
 
-This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent.
+This rule traverses bottom-up from the original Spark plan and attempts to 
replace each operator with a Comet equivalent.
 For example, a `ProjectExec` will be replaced by `CometProjectExec`.
 
 When replacing a node, various checks are performed to determine if Comet can 
support the operator and its expressions.
@@ -51,9 +62,27 @@ Comet does not support partially replacing subsets of the 
plan within a query st
 transitions to convert between row-based and columnar data between Spark 
operators and Comet operators and the overhead
 of this could outweigh the benefits of running parts of the query stage 
natively in Comet.
 
-Once the plan has been transformed, it is serialized into Comet protocol 
buffer format by the `QueryPlanSerde` class
-and this serialized plan is passed into the native code by `CometExecIterator`.
+## Query Execution
+
+Once the plan has been transformed, any consecutive Comet operators are 
combined into a `CometNativeExec` which contains
+a serialized version of the plan (the serialization code can be found in 
`QueryPlanSerde`). When this operator is
+executed, the serialized plan is passed to the native code when calling 
`Native.createPlan`.
 
 In the native code there is a `PhysicalPlanner` struct (in `planner.rs`) which 
converts the serialized plan into an
-Apache DataFusion physical plan. In some cases, Comet provides specialized 
physical operators and expressions to
+Apache DataFusion `ExecutionPlan`. In some cases, Comet provides specialized 
physical operators and expressions to
 override the DataFusion versions to ensure compatibility with Apache Spark.
+
+`CometExecIterator` will invoke `Native.executePlan` to pull the next batch 
from the native plan. This is repeated
+until no more batches are available (meaning that all data has been processed 
by the native plan).
+
+The leaf nodes in the physical plan are always `ScanExec` and these operators 
consume batches of Arrow data that were
+prepared before the plan is executed. When `CometExecIterator` invokes 
`Native.executePlan` it passes the memory
+addresses of these Arrow arrays to the native code.
+
+![Diagram of Comet Native 
Execution](../../_static/images/CometNativeExecution.drawio.png)
+
+## End to End Flow
+
+The following diagram shows the end-to-end flow.
+
+![Diagram of Comet Native Parquet 
Scan](../../_static/images/CometNativeParquetScan.drawio.png)
diff --git a/_static/images/CometNativeExecution.drawio.png 
b/_static/images/CometNativeExecution.drawio.png
new file mode 100644
index 00000000..ba122a1f
Binary files /dev/null and b/_static/images/CometNativeExecution.drawio.png 
differ
diff --git a/_static/images/CometNativeParquetScan.drawio.png 
b/_static/images/CometNativeParquetScan.drawio.png
new file mode 100644
index 00000000..712cbae4
Binary files /dev/null and b/_static/images/CometNativeParquetScan.drawio.png 
differ
diff --git a/contributor-guide/contributing.html 
b/contributor-guide/contributing.html
index 6e2c576f..b53f1435 100644
--- a/contributor-guide/contributing.html
+++ b/contributor-guide/contributing.html
@@ -53,7 +53,7 @@ under the License.
     <script async="true" defer="true" 
src="https://buttons.github.io/buttons.js";></script>
     <link rel="index" title="Index" href="../genindex.html" />
     <link rel="search" title="Search" href="../search.html" />
-    <link rel="next" title="Comet Plugin Overview" href="plugin_overview.html" 
/>
+    <link rel="next" title="Comet Plugin Architecture" 
href="plugin_overview.html" />
     <link rel="prev" title="Tuning Guide" href="../user-guide/tuning.html" />
     <meta name="viewport" content="width=device-width, initial-scale=1" />
     <meta name="docsearch:language" content="en">
@@ -384,7 +384,7 @@ coordinate on issues that they are working on.</p>
     <a class='right-next' id="next-link" href="plugin_overview.html" 
title="next page">
     <div class="prev-next-info">
         <p class="prev-next-subtitle">next</p>
-        <p class="prev-next-title">Comet Plugin Overview</p>
+        <p class="prev-next-title">Comet Plugin Architecture</p>
     </div>
     <i class="fas fa-angle-right"></i>
     </a>
diff --git a/contributor-guide/development.html 
b/contributor-guide/development.html
index 164e54c6..914965e0 100644
--- a/contributor-guide/development.html
+++ b/contributor-guide/development.html
@@ -54,7 +54,7 @@ under the License.
     <link rel="index" title="Index" href="../genindex.html" />
     <link rel="search" title="Search" href="../search.html" />
     <link rel="next" title="Comet Debugging Guide" href="debugging.html" />
-    <link rel="prev" title="Comet Plugin Overview" href="plugin_overview.html" 
/>
+    <link rel="prev" title="Comet Plugin Architecture" 
href="plugin_overview.html" />
     <meta name="viewport" content="width=device-width, initial-scale=1" />
     <meta name="docsearch:language" content="en">
     
@@ -504,7 +504,7 @@ automatically format the code. Before submitting a pull 
request, you can simply
         <i class="fas fa-angle-left"></i>
         <div class="prev-next-info">
             <p class="prev-next-subtitle">previous</p>
-            <p class="prev-next-title">Comet Plugin Overview</p>
+            <p class="prev-next-title">Comet Plugin Architecture</p>
         </div>
     </a>
     <a class='right-next' id="next-link" href="debugging.html" title="next 
page">
diff --git a/contributor-guide/plugin_overview.html 
b/contributor-guide/plugin_overview.html
index fc9e7f97..3336c8d9 100644
--- a/contributor-guide/plugin_overview.html
+++ b/contributor-guide/plugin_overview.html
@@ -24,7 +24,7 @@ under the License.
     <meta charset="utf-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" 
/><meta name="viewport" content="width=device-width, initial-scale=1" />
 
-    <title>Comet Plugin Overview &#8212; Apache DataFusion Comet  
documentation</title>
+    <title>Comet Plugin Architecture &#8212; Apache DataFusion Comet  
documentation</title>
     
     <link href="../_static/styles/theme.css?digest=1999514e3f237ded88cf" 
rel="stylesheet">
 <link 
href="../_static/styles/pydata-sphinx-theme.css?digest=1999514e3f237ded88cf" 
rel="stylesheet">
@@ -271,6 +271,11 @@ under the License.
 
 <nav id="bd-toc-nav">
     <ul class="visible nav section-nav flex-column">
+ <li class="toc-h2 nav-item toc-entry">
+  <a class="reference internal nav-link" href="#comet-sql-plugin">
+   Comet SQL Plugin
+  </a>
+ </li>
  <li class="toc-h2 nav-item toc-entry">
   <a class="reference internal nav-link" href="#cometscanrule">
    CometScanRule
@@ -281,6 +286,16 @@ under the License.
    CometExecRule
   </a>
  </li>
+ <li class="toc-h2 nav-item toc-entry">
+  <a class="reference internal nav-link" href="#query-execution">
+   Query Execution
+  </a>
+ </li>
+ <li class="toc-h2 nav-item toc-entry">
+  <a class="reference internal nav-link" href="#end-to-end-flow">
+   End to End Flow
+  </a>
+ </li>
 </ul>
 
 </nav>
@@ -327,24 +342,36 @@ KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
-<section id="comet-plugin-overview">
-<h1>Comet Plugin Overview<a class="headerlink" href="#comet-plugin-overview" 
title="Link to this heading">¶</a></h1>
-<p>The entry point to Comet is the <code class="docutils literal 
notranslate"><span class="pre">org.apache.spark.CometPlugin</span></code> 
class, which can be registered with Spark by adding the following setting to 
the Spark configuration when launching <code class="docutils literal 
notranslate"><span class="pre">spark-shell</span></code> or <code 
class="docutils literal notranslate"><span 
class="pre">spark-submit</span></code>:</p>
+<section id="comet-plugin-architecture">
+<h1>Comet Plugin Architecture<a class="headerlink" 
href="#comet-plugin-architecture" title="Link to this heading">¶</a></h1>
+<section id="comet-sql-plugin">
+<h2>Comet SQL Plugin<a class="headerlink" href="#comet-sql-plugin" title="Link 
to this heading">¶</a></h2>
+<p>The entry point to Comet is the <code class="docutils literal 
notranslate"><span class="pre">org.apache.spark.CometPlugin</span></code> 
class, which can be registered with Spark by adding the
+following setting to the Spark configuration when launching <code 
class="docutils literal notranslate"><span 
class="pre">spark-shell</span></code> or <code class="docutils literal 
notranslate"><span class="pre">spark-submit</span></code>:</p>
 <div class="highlight-default notranslate"><div 
class="highlight"><pre><span></span><span class="o">--</span><span 
class="n">conf</span> <span class="n">spark</span><span class="o">.</span><span 
class="n">plugins</span><span class="o">=</span><span class="n">org</span><span 
class="o">.</span><span class="n">apache</span><span class="o">.</span><span 
class="n">spark</span><span class="o">.</span><span class="n">CometPlugin</span>
 </pre></div>
 </div>
-<p>On initialization, this class registers two physical plan optimization 
rules with Spark: <code class="docutils literal notranslate"><span 
class="pre">CometScanRule</span></code> and <code class="docutils literal 
notranslate"><span class="pre">CometExecRule</span></code>. These rules run 
whenever a query stage is being planned.</p>
+<p>On initialization, this class registers two physical plan optimization 
rules with Spark: <code class="docutils literal notranslate"><span 
class="pre">CometScanRule</span></code>
+and <code class="docutils literal notranslate"><span 
class="pre">CometExecRule</span></code>. These rules run whenever a query stage 
is being planned during Adaptive Query Execution, and
+run once for the entire plan when Adaptive Query Execution is disabled.</p>
+</section>
 <section id="cometscanrule">
 <h2>CometScanRule<a class="headerlink" href="#cometscanrule" title="Link to 
this heading">¶</a></h2>
-<p><code class="docutils literal notranslate"><span 
class="pre">CometScanRule</span></code> replaces any Parquet scans with Comet 
Parquet scan classes.</p>
-<p>When the V1 data source API is being used, <code class="docutils literal 
notranslate"><span class="pre">FileSourceScanExec</span></code> is replaced 
with <code class="docutils literal notranslate"><span 
class="pre">CometScanExec</span></code>.</p>
-<p>When the V2 data source API is being used, <code class="docutils literal 
notranslate"><span class="pre">BatchScanExec</span></code> is replaced with 
<code class="docutils literal notranslate"><span 
class="pre">CometBatchScanExec</span></code>.</p>
+<p><code class="docutils literal notranslate"><span 
class="pre">CometScanRule</span></code> replaces any Parquet scans with Comet 
operators. There are different paths for Spark v1 and v2 data sources.</p>
+<p>When reading from Parquet v1 data sources, Comet replaces <code 
class="docutils literal notranslate"><span 
class="pre">FileSourceScanExec</span></code> with a <code class="docutils 
literal notranslate"><span class="pre">CometScanExec</span></code>, and for v2
+data sources, <code class="docutils literal notranslate"><span 
class="pre">BatchScanExec</span></code> is replaced with <code class="docutils 
literal notranslate"><span class="pre">CometBatchScanExec</span></code>. In 
both cases, Comet replaces Spark’s Parquet
+reader with a custom vectorized Parquet reader. This is similar to Spark’s 
vectorized Parquet reader used by the v2
+Parquet data source but leverages native code for decoding Parquet row groups 
directly into Arrow format.</p>
+<p>Comet only supports a subset of data types and will fall back to Spark’s 
scan if unsupported types
+exist. Comet can still accelerate the rest of the query execution in this case 
because <code class="docutils literal notranslate"><span 
class="pre">CometSparkToColumnarExec</span></code> will
+convert the output from Spark’s can to Arrow arrays. Note that both <code 
class="docutils literal notranslate"><span 
class="pre">spark.comet.exec.enabled=true</span></code> and
+<code class="docutils literal notranslate"><span 
class="pre">spark.comet.convert.parquet.enabled=true</span></code> must be set 
to enable this conversion.</p>
+<p>Refer to the <a class="reference external" 
href="https://datafusion.apache.org/comet/user-guide/datatypes.html";>Supported 
Spark Data Types</a> section
+in the contributor guide to see a list of currently supported data types.</p>
 </section>
 <section id="cometexecrule">
 <h2>CometExecRule<a class="headerlink" href="#cometexecrule" title="Link to 
this heading">¶</a></h2>
-<p><code class="docutils literal notranslate"><span 
class="pre">CometExecRule</span></code> attempts to transform a Spark physical 
plan into a Comet plan. This rule is executed against
-individual query stages when they are being prepared for execution.</p>
-<p>This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent.
+<p>This rule traverses bottom-up from the original Spark plan and attempts to 
replace each operator with a Comet equivalent.
 For example, a <code class="docutils literal notranslate"><span 
class="pre">ProjectExec</span></code> will be replaced by <code class="docutils 
literal notranslate"><span class="pre">CometProjectExec</span></code>.</p>
 <p>When replacing a node, various checks are performed to determine if Comet 
can support the operator and its expressions.
 If an operator, expression, or data type is not supported by Comet then the 
reason will be stored in a tag on the
@@ -352,11 +379,26 @@ underlying Spark node and the plan will not be 
converted.</p>
 <p>Comet does not support partially replacing subsets of the plan within a 
query stage because this would involve adding
 transitions to convert between row-based and columnar data between Spark 
operators and Comet operators and the overhead
 of this could outweigh the benefits of running parts of the query stage 
natively in Comet.</p>
-<p>Once the plan has been transformed, it is serialized into Comet protocol 
buffer format by the <code class="docutils literal notranslate"><span 
class="pre">QueryPlanSerde</span></code> class
-and this serialized plan is passed into the native code by <code 
class="docutils literal notranslate"><span 
class="pre">CometExecIterator</span></code>.</p>
+</section>
+<section id="query-execution">
+<h2>Query Execution<a class="headerlink" href="#query-execution" title="Link 
to this heading">¶</a></h2>
+<p>Once the plan has been transformed, any consecutive Comet operators are 
combined into a <code class="docutils literal notranslate"><span 
class="pre">CometNativeExec</span></code> which contains
+a serialized version of the plan (the serialization code can be found in <code 
class="docutils literal notranslate"><span 
class="pre">QueryPlanSerde</span></code>). When this operator is
+executed, the serialized plan is passed to the native code when calling <code 
class="docutils literal notranslate"><span 
class="pre">Native.createPlan</span></code>.</p>
 <p>In the native code there is a <code class="docutils literal 
notranslate"><span class="pre">PhysicalPlanner</span></code> struct (in <code 
class="docutils literal notranslate"><span 
class="pre">planner.rs</span></code>) which converts the serialized plan into an
-Apache DataFusion physical plan. In some cases, Comet provides specialized 
physical operators and expressions to
+Apache DataFusion <code class="docutils literal notranslate"><span 
class="pre">ExecutionPlan</span></code>. In some cases, Comet provides 
specialized physical operators and expressions to
 override the DataFusion versions to ensure compatibility with Apache Spark.</p>
+<p><code class="docutils literal notranslate"><span 
class="pre">CometExecIterator</span></code> will invoke <code class="docutils 
literal notranslate"><span class="pre">Native.executePlan</span></code> to pull 
the next batch from the native plan. This is repeated
+until no more batches are available (meaning that all data has been processed 
by the native plan).</p>
+<p>The leaf nodes in the physical plan are always <code class="docutils 
literal notranslate"><span class="pre">ScanExec</span></code> and these 
operators consume batches of Arrow data that were
+prepared before the plan is executed. When <code class="docutils literal 
notranslate"><span class="pre">CometExecIterator</span></code> invokes <code 
class="docutils literal notranslate"><span 
class="pre">Native.executePlan</span></code> it passes the memory
+addresses of these Arrow arrays to the native code.</p>
+<p><img alt="Diagram of Comet Native Execution" 
src="../_static/images/CometNativeExecution.drawio.png" /></p>
+</section>
+<section id="end-to-end-flow">
+<h2>End to End Flow<a class="headerlink" href="#end-to-end-flow" title="Link 
to this heading">¶</a></h2>
+<p>The following diagram shows the end-to-end flow.</p>
+<p><img alt="Diagram of Comet Native Parquet Scan" 
src="../_static/images/CometNativeParquetScan.drawio.png" /></p>
 </section>
 </section>
 
diff --git a/objects.inv b/objects.inv
index 0523a739..013792d3 100644
Binary files a/objects.inv and b/objects.inv differ
diff --git a/searchindex.js b/searchindex.js
index 3cbbe3f7..b023388c 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"alltitles": {"1. Install Comet": [[9, "install-comet"]], "2. 
Clone Spark and Apply Diff": [[9, "clone-spark-and-apply-diff"]], "3. Run Spark 
SQL Tests": [[9, "run-spark-sql-tests"]], "ANSI mode": [[11, "ansi-mode"]], 
"API Differences Between Spark Versions": [[0, 
"api-differences-between-spark-versions"]], "ASF Links": [[10, null]], "Adding 
Spark-side Tests for the New Expression": [[0, 
"adding-spark-side-tests-for-the-new-expression"]], "Adding a New Expression": 
[[0,  [...]
\ No newline at end of file
+Search.setIndex({"alltitles": {"1. Install Comet": [[9, "install-comet"]], "2. 
Clone Spark and Apply Diff": [[9, "clone-spark-and-apply-diff"]], "3. Run Spark 
SQL Tests": [[9, "run-spark-sql-tests"]], "ANSI mode": [[11, "ansi-mode"]], 
"API Differences Between Spark Versions": [[0, 
"api-differences-between-spark-versions"]], "ASF Links": [[10, null]], "Adding 
Spark-side Tests for the New Expression": [[0, 
"adding-spark-side-tests-for-the-new-expression"]], "Adding a New Expression": 
[[0,  [...]
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to