andygrove opened a new pull request, #46:
URL: https://github.com/apache/datafusion-java/pull/46

   ## Which issue does this PR close?
   
   No tracking issue yet — happy to file one if useful. This implements the 
"Java UDFs" item that was on the (now-removed) project-status checklist.
   
   ## Rationale for this change
   
   Users can already drive DataFusion via SQL and the DataFrame API; what's 
missing is a path to register a Java-implemented function and call it like any 
built-in. v1 covers scalar UDFs only — exact signatures, declared argument and 
return types, three volatilities. Aggregate / window / table UDFs are deferred.
   
   ## What changes are included in this PR?
   
   **Public Java API**
   - `ScalarUdf` `@FunctionalInterface` with one method `FieldVector 
evaluate(BufferAllocator allocator, List<FieldVector> args)`
   - `Volatility` enum (`IMMUTABLE` / `STABLE` / `VOLATILE`)
   - `SessionContext.registerUdf(name, udf, returnType, argTypes, volatility)`
   
   **Internals**
   - `org.apache.datafusion.internal.JniBridge` — per-call static trampoline 
that imports the input columns, calls user code, validates (non-null + row 
count), and exports the result via the Arrow C Data Interface
   - `native/src/udf.rs` — `JavaScalarUdf` implementing DataFusion's 
`ScalarUDFImpl`. Holds `GlobalRef`s to the user instance + bridge class and a 
cached `JStaticMethodID`; constructs an FFI struct-array view of the args, 
attaches the current thread to JNI, calls the bridge, translates any pending 
Java exception into a `DataFusionError::Execution`, imports the result via 
`arrow::ffi::from_ffi`, and validates the type matches the declared return type
   - `JNI_OnLoad` in `native/src/lib.rs` caches the `JavaVM*` in a `OnceLock` 
so DataFusion's worker threads can attach
   
   **Refinements from the design discussion**
   - Args ride a single FFI struct pair (`FFI_ArrowArray` + `FFI_ArrowSchema` 
for a struct-array view of the columns), not an `ArrowArrayStream` — v1 always 
sees one batch per invoke and the simpler ABI avoids streaming overhead.
   - The per-call allocator is a single shared static `RootAllocator` on 
`JniBridge` rather than a fresh-per-call one — closing a per-call allocator 
while an FFI release callback still holds buffer references would throw.
   - `StructArray::try_new_with_length(... , number_rows)` is used for the args 
export so a zero-argument UDF doesn't panic.
   - `number_rows` is `usize::try_into::<i32>` checked before crossing JNI — a 
batch larger than `i32::MAX` would otherwise silently truncate and miscompare 
with Java's row-count check.
   
   **Docs + example**
   - `docs/source/user-guide/scalar-udf.md` (linked from the user-guide toctree)
   - `examples/src/main/java/org/apache/datafusion/examples/AddOneExample.java`
   
   ## How are these changes tested?
   
   `make test` from a clean checkout. The new `ScalarUdfTest` has 12 tests 
covering:
   
   - **Happy paths:** `add_one(Int32)`, `concat(Utf8, Utf8)`, 
`square(Float64)`, repeated invocations in one session, a 100-row VALUES scan
   - **Contract violations:** UDF returning null, wrong row count, wrong type, 
and a UDF that throws `IllegalArgumentException` — all surface as 
`RuntimeException`s with the expected message substrings (class name + user 
message preserved)
   - **Lifecycle:** two UDFs registered in the same session, 
register-after-close in a new session
   - **Volatility:** all three `Volatility` values round-trip through 
registration and execute correctly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to