andygrove opened a new pull request, #4283: URL: https://github.com/apache/datafusion-comet/pull/4283
## Which issue does this PR close? Closes #. ## Rationale for this change Comet today supports user UDFs only through the JVM `CometUDF` path: a Scala/Java callback invoked over JNI for every batch. The user's `evaluate(Array[ValueVector])` body either loops in Scala or reaches for Arrow Java's compute kernels, both slower than the `arrow-rs` kernels Comet itself uses natively. This PR (experimental, draft) adds a parallel path for **scalar UDFs in Rust**. The user implements a small trait against `arrow-rs`, builds their crate as a `cdylib`, and registers the resulting `.so` / `.dylib` from Scala. Comet loads the library inside the executor and dispatches to it directly during native execution — no JVM round-trip per row. The cross-`.so` boundary uses the **Arrow C Data Interface** (`FFI_ArrowArray` / `FFI_ArrowSchema`), so user libraries are decoupled from Comet's `arrow-rs` and `datafusion` versions: the only stability contract is the SDK ABI version (currently `1`). ## What changes are included in this PR? Three new pieces, plus narrow integration in existing Comet: - **`comet-udf-sdk`** — public Rust crate. Defines `CometScalarUdf`, signature / type-tag / error types, an `export!` macro emitting versioned `extern "C"` entry points, and an optional `from_scalar_udf_impl` adapter behind the `datafusion-adapter` feature. - **`comet-test-udfs`** — in-tree test cdylib exposing five UDFs (happy path, struct-typed, user error, panic, length mismatch) used by host and end-to-end tests. - **`rust_udf` module** in `native/core` — `loader` (libloading + ABI check + descriptor parse), process-wide `cache`, and `RustUdfAdapter` impl `ScalarUDFImpl`. - **`RustUdfCall` proto** in `expr.proto` and a planner branch in `create_expr` that resolves the call against the cache and wraps the adapter as a `ScalarUDF`. - **JNI bridge** (`CometRustUdfBridge` / `comet_rust_udf_bridge.rs`) for driver-side `validateLibrary` / `listUdfs`. - **Scala API** — `CometRustUDF.register` / `registerAll`, `CometRustUdfRegistry`, typed exception classes. - **`QueryPlanSerde` branch** that recognizes a `ScalaUDF` whose name is registered and emits `RustUdfCall` instead. - **User guide** at `docs/source/user-guide/latest/custom-rust-udfs.md`. Marked experimental: scope is intentionally scalar-only, dynamic-library loading only, no JVM fallback, library distribution is the user's responsibility (Spark `--files` or pre-install). Aggregate / window / table-valued UDFs and richer nested-type signature mapping are deliberately deferred. ## How are these changes tested? - **SDK unit tests** (`comet-udf-sdk`) — 11 tests covering type-tag round-trip, IPC field encoding, error types, layout assertions for both `UdfError` and `UdfDescriptor`, the `EncodedSignature` builder, and the optional DataFusion adapter (signature derivation, scalar materialization, non-Exact rejection). - **Native host tests** (`native/core/src/execution/rust_udf/`) — 9 tests covering library load + ABI check, descriptor parse for primitive and struct-typed UDFs, process-wide cache identity, and four async tokio adapter tests that run UDFs end-to-end through DataFusion (happy path, user error, panic, length mismatch). - **Driver-side Scala suite** (`CometRustUdfRegistrySuite`) — 3 tests covering register / re-register / snapshot semantics on the driver registry. - **End-to-end Spark suite** (`CometRustUdfSuite`) — 6 tests pass: native execution of `add_one`, error / panic surfacing, missing-path failure, signature mismatch failure, and `registerAll` for primitive-typed UDFs. One test (`registerAll` over the struct-typed fixture) is currently cancelled — it hits a v1 limitation around mapping Arrow's `DataType::to_string` output for `Struct` to a Spark DDL parser-acceptable form. Documented in the user guide's Limitations section; works fine via explicit `register` with declared types. The end-to-end suite is gated on `-Dcomet.test.udfs.lib=<path to libcomet_test_udfs>`; the path is plumbed through scalatest's `systemProperties` in the root `pom.xml`. The Rust test crate's `core/build.rs` exposes the same path to native tests via the `COMET_TEST_UDFS_LIB` env var. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
