(datafusion-comet) branch main updated: docs: Add more detailed architecture documentation (#922)

agrove Thu, 19 Sep 2024 12:22:51 -0700

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git



The following commit(s) were added to refs/heads/main by this push:
     new 9dfd6d11 docs: Add more detailed architecture documentation (#922)
9dfd6d11 is described below

commit 9dfd6d110b329d47a505a0fee73a9f3e898c4929
Author: Andy Grove <[email protected]>
AuthorDate: Thu Sep 19 13:22:41 2024 -0600

    docs: Add more detailed architecture documentation (#922)
    
    * initial notes on parquet support
    
    * add draft diagram
    
    * update
    
    * improve diagram
    
    * improve diagram
    
    * improve diagram
    
    * improve diagram
    
    * improve diagram and docs based on feedback
    
    * rewrite section on Parquet support
    
    * rewrite section on Parquet support
    
    * update docs and add diagram showing execution flow
    
    * fix link
    
    * Improve diagrams
    
    * use fewer colors
    
    * remove temp files
    
    * remove some content, fix description of parquet reader
    
    * address feedback
    
    * address feedback
---
 .../_static/images/CometNativeExecution.drawio.png | Bin 0 -> 61017 bytes
 .../images/CometNativeParquetScan.drawio.png       | Bin 0 -> 75703 bytes
 docs/source/contributor-guide/plugin_overview.md   |  55 ++++++++++++++++-----
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/docs/source/_static/images/CometNativeExecution.drawio.png 
b/docs/source/_static/images/CometNativeExecution.drawio.png
new file mode 100644
index 00000000..ba122a1f
Binary files /dev/null and 
b/docs/source/_static/images/CometNativeExecution.drawio.png differ
diff --git a/docs/source/_static/images/CometNativeParquetScan.drawio.png 
b/docs/source/_static/images/CometNativeParquetScan.drawio.png
new file mode 100644
index 00000000..712cbae4
Binary files /dev/null and 
b/docs/source/_static/images/CometNativeParquetScan.drawio.png differ
diff --git a/docs/source/contributor-guide/plugin_overview.md 
b/docs/source/contributor-guide/plugin_overview.md
index 9e6a104b..c7538290 100644
--- a/docs/source/contributor-guide/plugin_overview.md
+++ b/docs/source/contributor-guide/plugin_overview.md
@@ -17,30 +17,41 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Comet Plugin Overview
+# Comet Plugin Architecture
 
-The entry point to Comet is the `org.apache.spark.CometPlugin` class, which 
can be registered with Spark by adding the following setting to the Spark 
configuration when launching `spark-shell` or `spark-submit`:
+## Comet SQL Plugin
+
+The entry point to Comet is the `org.apache.spark.CometPlugin` class, which 
can be registered with Spark by adding the
+following setting to the Spark configuration when launching `spark-shell` or 
`spark-submit`:
 
 ```
 --conf spark.plugins=org.apache.spark.CometPlugin
 ```
 
-On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a 
query stage is being planned.
+On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule`
+and `CometExecRule`. These rules run whenever a query stage is being planned 
during Adaptive Query Execution, and
+run once for the entire plan when Adaptive Query Execution is disabled.
 
 ## CometScanRule
 
-`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+`CometScanRule` replaces any Parquet scans with Comet operators. There are 
different paths for Spark v1 and v2 data sources.
 
-When the V1 data source API is being used, `FileSourceScanExec` is replaced 
with `CometScanExec`.
+When reading from Parquet v1 data sources, Comet replaces `FileSourceScanExec` 
with a `CometScanExec`, and for v2
+data sources, `BatchScanExec` is replaced with `CometBatchScanExec`. In both 
cases, Comet replaces Spark's Parquet
+reader with a custom vectorized Parquet reader. This is similar to Spark's 
vectorized Parquet reader used by the v2
+Parquet data source but leverages native code for decoding Parquet row groups 
directly into Arrow format.
 
-When the V2 data source API is being used, `BatchScanExec` is replaced with 
`CometBatchScanExec`.
+Comet only supports a subset of data types and will fall back to Spark's scan 
if unsupported types
+exist. Comet can still accelerate the rest of the query execution in this case 
because `CometSparkToColumnarExec` will
+convert the output from Spark's can to Arrow arrays. Note that both 
`spark.comet.exec.enabled=true` and
+`spark.comet.convert.parquet.enabled=true` must be set to enable this 
conversion.
 
-## CometExecRule
+Refer to the [Supported Spark Data 
Types](https://datafusion.apache.org/comet/user-guide/datatypes.html) section
+in the contributor guide to see a list of currently supported data types.
 
-`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. 
This rule is executed against
-individual query stages when they are being prepared for execution.
+## CometExecRule
 
-This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent.
+This rule traverses bottom-up from the original Spark plan and attempts to 
replace each operator with a Comet equivalent.
 For example, a `ProjectExec` will be replaced by `CometProjectExec`.
 
 When replacing a node, various checks are performed to determine if Comet can 
support the operator and its expressions.
@@ -51,9 +62,27 @@ Comet does not support partially replacing subsets of the 
plan within a query st
 transitions to convert between row-based and columnar data between Spark 
operators and Comet operators and the overhead
 of this could outweigh the benefits of running parts of the query stage 
natively in Comet.
 
-Once the plan has been transformed, it is serialized into Comet protocol 
buffer format by the `QueryPlanSerde` class
-and this serialized plan is passed into the native code by `CometExecIterator`.
+## Query Execution
+
+Once the plan has been transformed, any consecutive Comet operators are 
combined into a `CometNativeExec` which contains
+a serialized version of the plan (the serialization code can be found in 
`QueryPlanSerde`). When this operator is
+executed, the serialized plan is passed to the native code when calling 
`Native.createPlan`.
 
 In the native code there is a `PhysicalPlanner` struct (in `planner.rs`) which 
converts the serialized plan into an
-Apache DataFusion physical plan. In some cases, Comet provides specialized 
physical operators and expressions to
+Apache DataFusion `ExecutionPlan`. In some cases, Comet provides specialized 
physical operators and expressions to
 override the DataFusion versions to ensure compatibility with Apache Spark.
+
+`CometExecIterator` will invoke `Native.executePlan` to pull the next batch 
from the native plan. This is repeated
+until no more batches are available (meaning that all data has been processed 
by the native plan).
+
+The leaf nodes in the physical plan are always `ScanExec` and these operators 
consume batches of Arrow data that were
+prepared before the plan is executed. When `CometExecIterator` invokes 
`Native.executePlan` it passes the memory
+addresses of these Arrow arrays to the native code.
+
+![Diagram of Comet Native 
Execution](../../_static/images/CometNativeExecution.drawio.png)
+
+## End to End Flow
+
+The following diagram shows the end-to-end flow.
+
+![Diagram of Comet Native Parquet 
Scan](../../_static/images/CometNativeParquetScan.drawio.png)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-comet) branch main updated: docs: Add more detailed architecture documentation (#922)

Reply via email to