schenksj opened a new pull request, #4339:
URL: https://github.com/apache/datafusion-comet/pull/4339
# feat(contrib): introduce contrib extension SPI
Branch: `comet-contrib-spi`
Base: `main`
Commits: 18 (one architectural cluster + four review-fix rounds + a
doc-completeness pass + the symmetric-distribution refactor)
## Summary
Adds the infrastructure for *contrib extensions* — self-contained modules
that ship as
part of Comet's release but plug in via a stable SPI rather than being
hard-wired into
core. Core gains no functional behaviour change; with no contrib
feature/profile
enabled, the dispatch hooks are no-ops and `registered_contrib_kinds()` is
empty.
The first concrete contrib (`contrib/example/`) is a worked reference: Scala
extension
classes, a `CometScanRuleExtension` implementation, a Rust rlib registering a
`ContribOperatorPlanner`, a `META-INF/services/` entry, its own `.proto`
schema with
prost-build wiring, and unit tests covering registration, proto decode
round-trip, and
the error path. New contrib authors copy this directory layout.
This is the first of a two-PR sequence. **PR2** (a follow-up off `main`
after this
lands) will port Comet's existing delta-kernel-rs integration onto this SPI
and into
`contrib/delta/`. The SPI shape has been validated end-to-end against a real
Delta port
on a separate branch — see the testing section below.
## Distribution model
Both halves of a contrib are bundled into Comet's released artifacts at
build time when
their matching flags are enabled. Contribs are not independently
distributable — they
ship inside Comet's release.
- **JVM half** — Scala/Java sources under `contrib/<name>/src/main/scala/`,
compiled
INTO `comet-spark.jar` by activating `-Pcontrib-<name>` on
`spark/pom.xml`. The
contrib's `META-INF/services/` entries go along for the ride;
ServiceLoader at
runtime discovers them from inside `comet-spark.jar` itself. The contrib
has a tiny
`<packaging>pom</packaging>` Maven pom that exists solely to enumerate
external
deps (e.g., a Delta contrib's pom would carry
`<dependency>io.delta:delta-spark</dependency>`).
- **Native half** — a Rust `rlib` crate (NOT `cdylib`) linked INTO
`libcomet` via the
matching `--features contrib-<name>` Cargo flag on `native/core`. The
contrib's
`#[ctor]` registers its operator planners during library load.
`mvn install -Pcontrib-example && cargo build --features contrib-example`
produces a
Comet build that includes the example contrib in both `comet-spark.jar` and
`libcomet`.
A vanilla `mvn install && cargo build` produces a build with zero contrib
surface.
The wire format between JVM and native uses a single generic envelope on the
operator
proto, `ContribOp { kind, payload }`. Core's planner dispatches by `kind`;
the contrib's
native crate registers planners against the same `kind` string the contrib's
JVM code
writes into the proto.
## Architecture
```
JVM Native (single
libcomet)
CometSparkSessionExtensions.apply() #[cfg(feature =
"contrib-example")]
│ extern crate
comet_contrib_example;
├─ injectQueryStagePrepRule( │
│ CometScanRule._apply ▼
│ ├─ CometExtensionRegistry.load() contrib's #[ctor] runs at lib
load
│ │ (lazy, first invocation) │
│ │ discovers META-INF/services ▼
│ │ inside comet-spark.jar comet_contrib_spi::registry
(ArcSwap)
│ │ { "delta-scan" →
DeltaPlanner,
│ ├─ preTransform fold pass "example-no-op" →
NoOpPlanner,
│ │ (V1 only, gated on "example-constant-scan"→
ConstantScanPlanner }
│ │ COMET_NATIVE_SCAN_ENABLED) │
│ │ │
│ └─ per-scan dispatch: core's `OpStruct::ContribOp`
arm:
│ iterate registered exts
lookup_contrib_planner_by_kind(co.kind)
│ first match wins .plan(ctx, co.payload,
children)
│ │
└─ injectQueryStagePrepRule( ctx: ContribPlannerContext,
gives the
CometExecRule._apply contrib core's planner
services:
├─ CometExtensionRegistry.load() - session_ctx
│ (lazy mirror of above) - build_physical_expr
│ - convert_spark_schema
└─ (allExecs ∪ - prepare_object_store
CometExtensionRegistry -
build_parquet_datasource_exec
.mergedSerdes)
.get(op.getClass)
)
───── wire format ─────
OpStruct.contrib_op {
kind: "delta-scan",
payload: <contrib-private bytes,
decoded by the contrib's own
prost::Message>,
reserved 3 to 9;
}
```
## What's in this PR (18 commits)
### Foundational SPI (5 commits)
| Commit | What |
|---|---|
| `51eb0fff` | `ContribOp { kind, payload }` proto envelope; Rust planner
registry SPI |
| `f448693b` | Native `OpStruct::ContribOp` dispatcher arm in `planner.rs` |
| `f23500df` | Scala SPI: `CometScanRuleExtension`,
`CometOperatorSerdeExtension`, `CometExtensionRegistry` |
| `42234b96` | `CometScanRule` (V1 + V2) and `CometExecRule` consult the
registry |
| `8b694715` | `CometSparkSessionExtensions.apply` hooks the registry |
### Worked-reference contrib (3 commits)
| Commit | What |
|---|---|
| `d1553b55` | Rust half of `contrib/example/`: rlib crate, `#[ctor]`
registration; introduced `native/contrib-spi/` leaf crate to break a cyclic dep
|
| `5cb7099a` | JVM half of `contrib/example/`: Scala extension,
ServiceLoader entry, integration test |
| `8508ec50` | First version of the contributor guide |
### SPI shape refinements (2 commits)
| Commit | What |
|---|---|
| `e018076d` | Refinements from a Delta-port confidence check:
`preTransform` tree-level hook on `CometScanRuleExtension`, proto layer in
`contrib/example/`, class-keyed dispatch convention documented |
| `14e49448` | `ContribPlannerContext` trait + `ParquetDatasourceParams`
argument bundle (SPI gap #4 — see "Delta-port confidence check" below) |
### Review-fix rounds (4 commits)
| Commit | What |
|---|---|
| `8930b698` | First review pass: test isolation, V2-asymmetry doc,
`#[non_exhaustive]` markers, `preTransform` corruption guard, `#[ctor]` panic
safety, drop `contrib-example` from default features, gate registry load
lazily, cache mergedSerdes, multi-extension dispatch semantics,
`prepare_object_store` returns `Path`, gate preTransform on
`COMET_NATIVE_SCAN_ENABLED`, 16 MiB payload cap, "none discovered" diagnostic |
| `68fff43f` | Second pass: `CometExecRule` self-loads, stale docstring,
dead doc refs, `Display` wildcard fix, corruption-guard rewrite (identity
check, class-changing replacements caught), `ContribOp` size guard ordering,
encryption-asymmetry positional-arg test, owned `String` for session_timezone,
production-canary `#[cfg]` test |
| `e4e6e6c6` | Third pass: `IdentityHashMap` survivors set, `synchronized`
load() publication order, empty/whitespace kind rejection, doc accuracy, scope
notes |
| `6652963c` | Fourth pass: identity survivors + cost comment,
`resetForTesting` synchronized, doc trims, whitespace kind rejection,
`#[cfg(not(any(...)))]` form, public `resetForTesting`, wildcard-arm comment,
`Display` debug repr, `ContribOp` proto `reserved` block, payload cap doc |
### Contributor guide completeness (2 commits)
| Commit | What |
|---|---|
| `91c40e0a` | Comprehensive rewrite: prerequisites, JVM-side proto
guidance, full `plan()` body walkthrough using every `ContribPlannerContext`
method, `CometOperatorSerde[T]` contract, diagnostics story,
multi-Spark-version, end-to-end testing recipe, Cargo-canary maintenance note |
| `2c46552c` | Second-pass review fixes |
### Architectural pivot to symmetric distribution (2 commits)
| Commit | What |
|---|---|
| `cf5253ed` | Bundle JVM half INTO `comet-spark.jar` via Maven profile.
Mirrors the native side's "Cargo feature pulls rlib into libcomet" model. No
more separate contrib JARs. ~70 lines of protobuf-shading boilerplate deleted
from the contributor guide (shading now handled automatically by
`comet-spark`'s existing shade execution). |
| `c7656fcc` | Deps-only pom per contrib so contribs like Delta can pull in
external Maven deps (e.g. `delta-spark`) without recreating a Maven reactor
cycle. Registry primitive: `RwLock<HashMap>` → `ArcSwap<HashMap>` for lock-free
reads on the dispatch hot path. |
## Notable design decisions
### Why bundle into one artifact?
Previously the contrib's JVM half shipped as a separate Maven JAR users would
`--packages` or `--jars` onto their classpath. That asymmetry made no sense
given the
native side already requires a Comet rebuild (Cargo feature flag) for the
contrib to
work — pretending the JVM half was distributable independently was a
fiction. The
symmetric design has one artifact per side, both varying based on which
contribs were
enabled at Comet build time. This eliminated the protobuf-shading recipe that
externally-published contrib JARs needed (which was the single biggest
source of doc
complexity).
### Why a separate `comet-contrib-spi` Rust crate?
Cycle break. Core would need to depend on contribs (to link them); contribs
need core's
trait types (`ContribOperatorPlanner`). Solution: a leaf crate both depend
on, with
nothing depending back on core from a contrib.
### Why `ContribPlannerContext`?
Surfaced by the Delta-port confidence check. A real file-scan contrib needs
five
core-side facilities: `init_datasource_exec`,
`prepare_object_store_with_configs`,
`convert_spark_types_to_arrow_schema`, expression-planning (`create_expr`),
and a
`SessionContext` handle. Exposing these as a trait core implements (and
contribs use
via `&dyn ContribPlannerContext`) avoids a back-dep on core while giving
contribs
everything they need to compose with Comet's tuned parquet path.
### Why `ArcSwap` instead of `RwLock` for the registry?
Reads are on the dispatch hot path; writes happen exclusively during library
init from
sequential `#[ctor]`s. The init-once / read-many access pattern is what
`ArcSwap` is
designed for. The original `RwLock<HashMap>` would have introduced
reader-writer
contention with no actual concurrent-write workload to justify it.
### Why bundling-via-source-injection rather than bundling-via-shaded-JAR?
A separate Maven module per contrib whose JAR gets shaded into
`comet-spark.jar`
would form a Maven reactor cycle (contrib's pom depends on `comet-spark` for
SPI
types; `comet-spark`'s contrib profile depends on contrib's JAR).
Source-injection
avoids the cycle: the contrib's Scala sources are compiled INSIDE
`comet-spark`'s
own compilation pass (via `build-helper-maven-plugin`'s `add-source` goal);
no
separate compile, no per-contrib JAR, no cycle. External Maven deps
(`delta-spark`,
etc.) flow through the contrib's separate `<packaging>pom</packaging>`
artifact.
## The Delta-port confidence check
Before opening this PR, I ported Comet's existing Delta integration (~3,200
lines on
the `delta-kernel-phase-1` branch) onto this SPI as a confidence check. The
port
itself is not committed here — its purpose was to surface SPI gaps before
review.
Four gaps were found, all addressed in this PR:
1. **No tree-level pre-pass hook** → added
`CometScanRuleExtension.preTransform`.
2. **No reference for the proto layer in `contrib/example/`** → added a
trivial
`ExampleConstantScan` message, `build.rs`, prost-build wiring, and tests.
3. **Class-keyed serde dispatch convention undocumented** → documented in the
contributor guide.
4. **`ContribOperatorPlanner::plan` lacked access to core's parquet /
expression
machinery** → introduced `ContribPlannerContext` trait +
`ParquetDatasourceParams`
bundle in commit `14e49448`. The full ~150-line Delta dispatcher body
compiled
clean against the new SPI surface; every trait method was exercised
end-to-end.
Full findings live in `PR1-delta-port-findings.md` (not committed;
review-prep
artifact).
Net conclusion: the SPI is the right shape for a real consumer. PR2's Delta
port can
proceed mechanically with no further SPI surprises expected.
## Review iterations
The branch went through four independent clean-context code review passes
(general-purpose
subagent reviews launched fresh on each iteration's HEAD). Each pass
surfaced a different
class of issue:
- **Pass 1** (review of `14e49448`, fixed by `8930b698`): 6 blockers + 10
important +
nits. Test isolation, SemVer markers, V2 asymmetry, `#[ctor]` panic
safety, default
feature leakage, ServiceLoader gating, more.
- **Pass 2** (review of `8930b698`, fixed by `68fff43f`): 3 regressions + 4
polish
items + 10 new findings. Stale class docstring, dead doc refs, `Display`
info-loss, corruption-guard class-change bypass.
- **Pass 3** (review of `68fff43f`, fixed by `e4e6e6c6`): 2 regressions + 8
polish.
Survivors-set false positive, `load()` publication race.
- **Pass 4** (review of `e4e6e6c6`, fixed by `6652963c`): no blockers found;
6 polish.
Cost-comment accuracy, identity-map upgrade, payload-guard message
ordering.
A separate completeness validation of the contributor guide (the doc-only
audit pass)
identified ~13 gaps a real contrib author would hit. Closed in `91c40e0a` and
`2c46552c`.
## Build verification
- `cargo check` (default features): green.
- `cargo check --no-default-features`: green; zero contrib surface in the
resulting
`libcomet`.
- `cargo test -p comet-contrib-spi`: 7 tests pass (registry round-trip,
scoped
registration, kinds snapshot, params constructor/setters, `ContribError`
Display
preservation).
- `cargo test -p datafusion-comet --lib -- execution::planner::contrib`: 5
tests pass,
including the production-canary that asserts default cdylib has no
registered
contribs and the encryption-asymmetry test that catches positional-arg
swaps in
`init_datasource_exec`.
- `cargo test -p comet-contrib-example`: 4 tests pass (ctor registration,
decode +
build, zero rows, bad payload).
- Maven `-Pcontrib-example` activates the profile cleanly;
`comet-contrib-example-deps`
builds first, then `comet-spark` with the contrib's sources injected. No
reactor
cycle.
## What this PR is NOT
- It does NOT migrate Delta / kernel-rs to the new SPI. That's PR2.
- It does NOT exercise `CometOperatorSerdeExtension` from the example
contrib (the
example only demonstrates `CometScanRuleExtension`; the trait surface is
documented
and validated against the Delta-port confidence check).
- The Maven `BanDuplicateClasses` enforcer is no longer overridden
per-contrib
(contribs are no longer separate Maven modules); the rule applies to
`comet-spark`
itself as before.
## How to review
Suggested reading order:
1. `docs/source/contributor-guide/contrib-extensions.md` — author-facing
guide; doubles
as the architectural overview.
2. `native/proto/src/proto/operator.proto` — the `ContribOp` envelope
(small, look for
the new variant on `OpStruct`).
3. `native/contrib-spi/src/lib.rs` — the leaf SPI crate (~370 lines incl.
tests).
4. `spark/src/main/scala/org/apache/comet/spi/` — three small files defining
the JVM
SPI.
5. `native/core/src/execution/planner.rs` — the `OpStruct::ContribOp`
dispatcher arm
(~lines 1960–2020).
6. `native/core/src/execution/planner/contrib.rs` — `CorePlannerContext`
adapter that
exposes core's parquet/expression infrastructure to contribs through the
SPI trait.
7. `spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala` and
`CometExecRule.scala` — the integration hooks. Each is a small insertion
at the
top of `_apply`; the `preTransform` fold runs once per plan.
8. `contrib/example/` — the worked reference (deps-pom + Scala source +
Cargo crate).
9. `spark/pom.xml`'s `contrib-example` profile — the template for wiring a
new
contrib into the build.
## Risks / follow-ups (tracked for PR2)
- **`CometOperatorSerdeExtension` not yet exercised by a contrib.** The
example
contrib only demonstrates `CometScanRuleExtension`. PR2's Delta port will
exercise
the operator serde path via `CometDeltaNativeScanExec`'s dedicated class.
- **Native test runner needs `libjvm` on the dyld path.** Running
`cargo test -p datafusion-comet --lib` on macOS requires
`DYLD_LIBRARY_PATH=$JAVA_HOME/lib/server` (only relevant when the test
binary
transitively links against the JNI crate). Preexisting on `main` — not
introduced
by this PR — but worth documenting.
- **CI matrix should add a `-Pcontrib-example,--features contrib-example`
row** so the
bundling path is exercised in CI on every PR. Today only the slim build is
in CI.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]