[I] Reduce boilerplate in native expression implementations [datafusion-comet]

via GitHub Mon, 22 Jun 2026 12:17:41 -0700


andygrove opened a new issue, #4708:
URL: https://github.com/apache/datafusion-comet/issues/4708


   ## Is your feature request related to a problem or challenge?
   
   The native expression implementations in `native/spark-expr` and 
`native/core` contain a large amount of mechanical boilerplate that is 
copy-pasted across most `PhysicalExpr` and `ScalarUDFImpl` implementations. An 
audit of roughly 30 `PhysicalExpr` impls and 21 `ScalarUDFImpl` impls found 
several recurring patterns that add up to an estimated 1,500 to 2,000 lines of 
code that carry no real information and that contributors must re-derive by 
hand for every new expression.
   
   The repeated patterns are:
   
   1. **Hand-written `impl Hash` / `impl PartialEq`** that simply recurse over 
every field. Around 24 files do this; many could be a plain `#[derive(Hash, 
PartialEq, Eq)]`. `string_funcs/contains.rs` already uses derive, so the 
codebase is just inconsistent. Example: `math_funcs/negative.rs:49-60`.
   
   2. **Mechanical `PhysicalExpr` trait methods**: `children()`, 
`with_new_children()`, the `fmt_sql()` to `Display::fmt` forwarder, and 
pass-through `data_type()` / `nullable()`. The `fmt_sql` forwarder alone is 
duplicated verbatim in ~24 files. Note that `fmt_sql` is a required method on 
the upstream trait, so this cannot be solved with a trait default and needs a 
Comet-side macro. Example: `math_funcs/negative.rs:189-251`.
   
   3. **`ScalarUDFImpl` scaffolding**: a struct with `signature` + `aliases`, a 
`Default`, a `new()`, then `name()` / `signature()` / `return_type()` (often a 
constant) and an `invoke_with_args()` that forwards to a free function. ~21 
files, ~50 lines each. Examples: `array_funcs/size.rs:74-90`, 
`datetime_funcs/hours.rs:58-101`.
   
   4. **Scalar vs array dispatch**: the `match ColumnarValue { Array(a) => .., 
Scalar(s) => .. }` wrap/unwrap appears in ~8 functions. Example: 
`json_funcs/json_array_length.rs:71-85`.
   
   ## Describe the solution you'd like
   
   Introduce lightweight framework helpers to remove this boilerplate, rolled 
out incrementally so each step is independently reviewable and 
behavior-preserving:
   
   - [ ] **Replace hand-written `Hash` / `PartialEq` with `#[derive]`** where 
the manual impl covers every field with no transformation (the safe subset, 
roughly 13 structs). Impls that intentionally exclude a field (`registry`, 
`query_context`, `task_context`) or hold a non-deriving field (`Regex`, 
`Mutex`, `AtomicI64`, `Signature`) stay hand-written.
   - [ ] **A `macro_rules!` for the mechanical `PhysicalExpr` methods** 
(`children`, `with_new_children`, `fmt_sql`, optional `data_type` / `nullable` 
pass-through) generated from a field list.
   - [ ] **A declarative macro for `ScalarUDFImpl`** that generates the struct, 
`Default`, `new()`, and the trivial trait methods, leaving the author to write 
only the invoke logic.
   - [ ] **A `scalar_array_dispatch` helper** for the array/scalar wrap-unwrap 
pattern.
   
   Each item should land as its own PR.
   
   ## Describe alternatives you've considered
   
   - A proc-macro crate (`spark-expr-macros`) instead of `macro_rules!`. This 
is more powerful for reconstructing structs in `with_new_children`, but adds a 
build dependency and complexity. Defer unless `macro_rules!` proves 
insufficient.
   - Pushing trait defaults upstream into DataFusion. Not viable for `fmt_sql` 
in the short term since the trait lives in our dependency.
   
   ## Additional context
   
   This is an umbrella tracking issue. Start with the derive cleanup since it 
is pure, low-risk, and requires no new abstraction.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Reduce boilerplate in native expression implementations [datafusion-comet]

Reply via email to