andygrove opened a new pull request, #46: URL: https://github.com/apache/datafusion-java/pull/46
## Which issue does this PR close? No tracking issue yet — happy to file one if useful. This implements the "Java UDFs" item that was on the (now-removed) project-status checklist. ## Rationale for this change Users can already drive DataFusion via SQL and the DataFrame API; what's missing is a path to register a Java-implemented function and call it like any built-in. v1 covers scalar UDFs only — exact signatures, declared argument and return types, three volatilities. Aggregate / window / table UDFs are deferred. ## What changes are included in this PR? **Public Java API** - `ScalarUdf` `@FunctionalInterface` with one method `FieldVector evaluate(BufferAllocator allocator, List<FieldVector> args)` - `Volatility` enum (`IMMUTABLE` / `STABLE` / `VOLATILE`) - `SessionContext.registerUdf(name, udf, returnType, argTypes, volatility)` **Internals** - `org.apache.datafusion.internal.JniBridge` — per-call static trampoline that imports the input columns, calls user code, validates (non-null + row count), and exports the result via the Arrow C Data Interface - `native/src/udf.rs` — `JavaScalarUdf` implementing DataFusion's `ScalarUDFImpl`. Holds `GlobalRef`s to the user instance + bridge class and a cached `JStaticMethodID`; constructs an FFI struct-array view of the args, attaches the current thread to JNI, calls the bridge, translates any pending Java exception into a `DataFusionError::Execution`, imports the result via `arrow::ffi::from_ffi`, and validates the type matches the declared return type - `JNI_OnLoad` in `native/src/lib.rs` caches the `JavaVM*` in a `OnceLock` so DataFusion's worker threads can attach **Refinements from the design discussion** - Args ride a single FFI struct pair (`FFI_ArrowArray` + `FFI_ArrowSchema` for a struct-array view of the columns), not an `ArrowArrayStream` — v1 always sees one batch per invoke and the simpler ABI avoids streaming overhead. - The per-call allocator is a single shared static `RootAllocator` on `JniBridge` rather than a fresh-per-call one — closing a per-call allocator while an FFI release callback still holds buffer references would throw. - `StructArray::try_new_with_length(... , number_rows)` is used for the args export so a zero-argument UDF doesn't panic. - `number_rows` is `usize::try_into::<i32>` checked before crossing JNI — a batch larger than `i32::MAX` would otherwise silently truncate and miscompare with Java's row-count check. **Docs + example** - `docs/source/user-guide/scalar-udf.md` (linked from the user-guide toctree) - `examples/src/main/java/org/apache/datafusion/examples/AddOneExample.java` ## How are these changes tested? `make test` from a clean checkout. The new `ScalarUdfTest` has 12 tests covering: - **Happy paths:** `add_one(Int32)`, `concat(Utf8, Utf8)`, `square(Float64)`, repeated invocations in one session, a 100-row VALUES scan - **Contract violations:** UDF returning null, wrong row count, wrong type, and a UDF that throws `IllegalArgumentException` — all surface as `RuntimeException`s with the expected message substrings (class name + user message preserved) - **Lifecycle:** two UDFs registered in the same session, register-after-close in a new session - **Volatility:** all three `Volatility` values round-trip through registration and execute correctly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
