andygrove opened a new pull request, #4373: URL: https://github.com/apache/datafusion-comet/pull/4373
## Summary Adds a JVM UDF fallback so that `date_format` always runs inside Comet: - **Native (DataFusion `to_char`)** when the format is a literal in the strftime-mappable whitelist **and** the session timezone is UTC, **or** when `spark.comet.expression.DateFormatClass.allowIncompatible=true`. - **JVM UDF (`org.apache.comet.udf.DateFormatUDF`)** in every other case: non-UTC timezone, non-literal format, or a format outside the whitelist. The UDF wraps Spark's own `DateFormatClass` and is invoked through the existing `JvmScalarUdf` framework, so results match Spark by construction. Unlike the regexp engine config, no new user-visible knob is introduced — the JVM UDF is a transparent correctness fallback rather than an opt-in engine. ### Behavior change Cases that previously fell back to Spark (and broke up the Comet pipeline with row-group transitions) now stay inside the Comet operator. The native path is unchanged for the UTC + whitelisted-format case. ### Implementation notes - `DateFormatUDF` caches one `DateFormatClass` instance per `(format, timezone)`. Constructing with a literal format makes Spark's `formatterOption` lazy-val resolve to a reusable formatter, so the per-row work is just `eval(InternalRow(micros))`. - For scalar (literal-folded) formats — the common case — the cache lookup is hoisted out of the per-row loop to eliminate Tuple2 + HashMap.get allocations on the hot path. - The serde now always returns `Compatible` from `getSupportLevel`; the native-vs-UDF decision is made inside `convert`. ## Test plan - [x] `CometTemporalExpressionSuite` (Spark 3.5): all 27 tests pass - [x] Three existing fallback-reason tests rewritten to assert the JVM UDF path now runs inside Comet (`checkSparkAnswerAndOperator`) - [x] `date_format - timestamp_ntz input` now runs `checkSparkAnswerAndOperator` for **all** timezones (previously only UTC) - [x] `allowIncompatible` test fixed: corrected config key (`expression`, not `expr`) and switched from answer-comparison to operator-only assertion since the native path may legitimately diverge from Spark for non-UTC timezones - [ ] Spark 3.4 / 4.0 profile sanity - [ ] Benchmark UTC native path vs JVM UDF path on a representative workload ## Notes for reviewers This is a draft pending the items above and any feedback on the routing strategy in `CometDateFormat.convert`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
