andygrove opened a new issue, #4708:
URL: https://github.com/apache/datafusion-comet/issues/4708
## Is your feature request related to a problem or challenge?
The native expression implementations in `native/spark-expr` and
`native/core` contain a large amount of mechanical boilerplate that is
copy-pasted across most `PhysicalExpr` and `ScalarUDFImpl` implementations. An
audit of roughly 30 `PhysicalExpr` impls and 21 `ScalarUDFImpl` impls found
several recurring patterns that add up to an estimated 1,500 to 2,000 lines of
code that carry no real information and that contributors must re-derive by
hand for every new expression.
The repeated patterns are:
1. **Hand-written `impl Hash` / `impl PartialEq`** that simply recurse over
every field. Around 24 files do this; many could be a plain `#[derive(Hash,
PartialEq, Eq)]`. `string_funcs/contains.rs` already uses derive, so the
codebase is just inconsistent. Example: `math_funcs/negative.rs:49-60`.
2. **Mechanical `PhysicalExpr` trait methods**: `children()`,
`with_new_children()`, the `fmt_sql()` to `Display::fmt` forwarder, and
pass-through `data_type()` / `nullable()`. The `fmt_sql` forwarder alone is
duplicated verbatim in ~24 files. Note that `fmt_sql` is a required method on
the upstream trait, so this cannot be solved with a trait default and needs a
Comet-side macro. Example: `math_funcs/negative.rs:189-251`.
3. **`ScalarUDFImpl` scaffolding**: a struct with `signature` + `aliases`, a
`Default`, a `new()`, then `name()` / `signature()` / `return_type()` (often a
constant) and an `invoke_with_args()` that forwards to a free function. ~21
files, ~50 lines each. Examples: `array_funcs/size.rs:74-90`,
`datetime_funcs/hours.rs:58-101`.
4. **Scalar vs array dispatch**: the `match ColumnarValue { Array(a) => ..,
Scalar(s) => .. }` wrap/unwrap appears in ~8 functions. Example:
`json_funcs/json_array_length.rs:71-85`.
## Describe the solution you'd like
Introduce lightweight framework helpers to remove this boilerplate, rolled
out incrementally so each step is independently reviewable and
behavior-preserving:
- [ ] **Replace hand-written `Hash` / `PartialEq` with `#[derive]`** where
the manual impl covers every field with no transformation (the safe subset,
roughly 13 structs). Impls that intentionally exclude a field (`registry`,
`query_context`, `task_context`) or hold a non-deriving field (`Regex`,
`Mutex`, `AtomicI64`, `Signature`) stay hand-written.
- [ ] **A `macro_rules!` for the mechanical `PhysicalExpr` methods**
(`children`, `with_new_children`, `fmt_sql`, optional `data_type` / `nullable`
pass-through) generated from a field list.
- [ ] **A declarative macro for `ScalarUDFImpl`** that generates the struct,
`Default`, `new()`, and the trivial trait methods, leaving the author to write
only the invoke logic.
- [ ] **A `scalar_array_dispatch` helper** for the array/scalar wrap-unwrap
pattern.
Each item should land as its own PR.
## Describe alternatives you've considered
- A proc-macro crate (`spark-expr-macros`) instead of `macro_rules!`. This
is more powerful for reconstructing structs in `with_new_children`, but adds a
build dependency and complexity. Defer unless `macro_rules!` proves
insufficient.
- Pushing trait defaults upstream into DataFusion. Not viable for `fmt_sql`
in the short term since the trait lives in our dependency.
## Additional context
This is an umbrella tracking issue. Start with the derive cleanup since it
is pure, low-risk, and requires no new abstraction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]