This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/main by this push:
new 9dfd6d11 docs: Add more detailed architecture documentation (#922)
9dfd6d11 is described below
commit 9dfd6d110b329d47a505a0fee73a9f3e898c4929
Author: Andy Grove <[email protected]>
AuthorDate: Thu Sep 19 13:22:41 2024 -0600
docs: Add more detailed architecture documentation (#922)
* initial notes on parquet support
* add draft diagram
* update
* improve diagram
* improve diagram
* improve diagram
* improve diagram
* improve diagram and docs based on feedback
* rewrite section on Parquet support
* rewrite section on Parquet support
* update docs and add diagram showing execution flow
* fix link
* Improve diagrams
* use fewer colors
* remove temp files
* remove some content, fix description of parquet reader
* address feedback
* address feedback
---
.../_static/images/CometNativeExecution.drawio.png | Bin 0 -> 61017 bytes
.../images/CometNativeParquetScan.drawio.png | Bin 0 -> 75703 bytes
docs/source/contributor-guide/plugin_overview.md | 55 ++++++++++++++++-----
3 files changed, 42 insertions(+), 13 deletions(-)
diff --git a/docs/source/_static/images/CometNativeExecution.drawio.png
b/docs/source/_static/images/CometNativeExecution.drawio.png
new file mode 100644
index 00000000..ba122a1f
Binary files /dev/null and
b/docs/source/_static/images/CometNativeExecution.drawio.png differ
diff --git a/docs/source/_static/images/CometNativeParquetScan.drawio.png
b/docs/source/_static/images/CometNativeParquetScan.drawio.png
new file mode 100644
index 00000000..712cbae4
Binary files /dev/null and
b/docs/source/_static/images/CometNativeParquetScan.drawio.png differ
diff --git a/docs/source/contributor-guide/plugin_overview.md
b/docs/source/contributor-guide/plugin_overview.md
index 9e6a104b..c7538290 100644
--- a/docs/source/contributor-guide/plugin_overview.md
+++ b/docs/source/contributor-guide/plugin_overview.md
@@ -17,30 +17,41 @@ specific language governing permissions and limitations
under the License.
-->
-# Comet Plugin Overview
+# Comet Plugin Architecture
-The entry point to Comet is the `org.apache.spark.CometPlugin` class, which
can be registered with Spark by adding the following setting to the Spark
configuration when launching `spark-shell` or `spark-submit`:
+## Comet SQL Plugin
+
+The entry point to Comet is the `org.apache.spark.CometPlugin` class, which
can be registered with Spark by adding the
+following setting to the Spark configuration when launching `spark-shell` or
`spark-submit`:
```
--conf spark.plugins=org.apache.spark.CometPlugin
```
-On initialization, this class registers two physical plan optimization rules
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a
query stage is being planned.
+On initialization, this class registers two physical plan optimization rules
with Spark: `CometScanRule`
+and `CometExecRule`. These rules run whenever a query stage is being planned
during Adaptive Query Execution, and
+run once for the entire plan when Adaptive Query Execution is disabled.
## CometScanRule
-`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+`CometScanRule` replaces any Parquet scans with Comet operators. There are
different paths for Spark v1 and v2 data sources.
-When the V1 data source API is being used, `FileSourceScanExec` is replaced
with `CometScanExec`.
+When reading from Parquet v1 data sources, Comet replaces `FileSourceScanExec`
with a `CometScanExec`, and for v2
+data sources, `BatchScanExec` is replaced with `CometBatchScanExec`. In both
cases, Comet replaces Spark's Parquet
+reader with a custom vectorized Parquet reader. This is similar to Spark's
vectorized Parquet reader used by the v2
+Parquet data source but leverages native code for decoding Parquet row groups
directly into Arrow format.
-When the V2 data source API is being used, `BatchScanExec` is replaced with
`CometBatchScanExec`.
+Comet only supports a subset of data types and will fall back to Spark's scan
if unsupported types
+exist. Comet can still accelerate the rest of the query execution in this case
because `CometSparkToColumnarExec` will
+convert the output from Spark's can to Arrow arrays. Note that both
`spark.comet.exec.enabled=true` and
+`spark.comet.convert.parquet.enabled=true` must be set to enable this
conversion.
-## CometExecRule
+Refer to the [Supported Spark Data
Types](https://datafusion.apache.org/comet/user-guide/datatypes.html) section
+in the contributor guide to see a list of currently supported data types.
-`CometExecRule` attempts to transform a Spark physical plan into a Comet plan.
This rule is executed against
-individual query stages when they are being prepared for execution.
+## CometExecRule
-This rule traverses bottom-up from the original Spark plan and attempts to
replace each node with a Comet equivalent.
+This rule traverses bottom-up from the original Spark plan and attempts to
replace each operator with a Comet equivalent.
For example, a `ProjectExec` will be replaced by `CometProjectExec`.
When replacing a node, various checks are performed to determine if Comet can
support the operator and its expressions.
@@ -51,9 +62,27 @@ Comet does not support partially replacing subsets of the
plan within a query st
transitions to convert between row-based and columnar data between Spark
operators and Comet operators and the overhead
of this could outweigh the benefits of running parts of the query stage
natively in Comet.
-Once the plan has been transformed, it is serialized into Comet protocol
buffer format by the `QueryPlanSerde` class
-and this serialized plan is passed into the native code by `CometExecIterator`.
+## Query Execution
+
+Once the plan has been transformed, any consecutive Comet operators are
combined into a `CometNativeExec` which contains
+a serialized version of the plan (the serialization code can be found in
`QueryPlanSerde`). When this operator is
+executed, the serialized plan is passed to the native code when calling
`Native.createPlan`.
In the native code there is a `PhysicalPlanner` struct (in `planner.rs`) which
converts the serialized plan into an
-Apache DataFusion physical plan. In some cases, Comet provides specialized
physical operators and expressions to
+Apache DataFusion `ExecutionPlan`. In some cases, Comet provides specialized
physical operators and expressions to
override the DataFusion versions to ensure compatibility with Apache Spark.
+
+`CometExecIterator` will invoke `Native.executePlan` to pull the next batch
from the native plan. This is repeated
+until no more batches are available (meaning that all data has been processed
by the native plan).
+
+The leaf nodes in the physical plan are always `ScanExec` and these operators
consume batches of Arrow data that were
+prepared before the plan is executed. When `CometExecIterator` invokes
`Native.executePlan` it passes the memory
+addresses of these Arrow arrays to the native code.
+
+
+
+## End to End Flow
+
+The following diagram shows the end-to-end flow.
+
+
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]