[PR] feat: JVM UDF fallback for date_format [datafusion-comet]

via GitHub Wed, 20 May 2026 10:38:03 -0700


andygrove opened a new pull request, #4373:
URL: https://github.com/apache/datafusion-comet/pull/4373


   ## Summary
   
   Adds a JVM UDF fallback so that `date_format` always runs inside Comet:
   
   - **Native (DataFusion `to_char`)** when the format is a literal in the 
strftime-mappable whitelist **and** the session timezone is UTC, **or** when 
`spark.comet.expression.DateFormatClass.allowIncompatible=true`.
   - **JVM UDF (`org.apache.comet.udf.DateFormatUDF`)** in every other case: 
non-UTC timezone, non-literal format, or a format outside the whitelist. The 
UDF wraps Spark's own `DateFormatClass` and is invoked through the existing 
`JvmScalarUdf` framework, so results match Spark by construction.
   
   Unlike the regexp engine config, no new user-visible knob is introduced — 
the JVM UDF is a transparent correctness fallback rather than an opt-in engine.
   
   ### Behavior change
   
   Cases that previously fell back to Spark (and broke up the Comet pipeline 
with row-group transitions) now stay inside the Comet operator. The native path 
is unchanged for the UTC + whitelisted-format case.
   
   ### Implementation notes
   
   - `DateFormatUDF` caches one `DateFormatClass` instance per `(format, 
timezone)`. Constructing with a literal format makes Spark's `formatterOption` 
lazy-val resolve to a reusable formatter, so the per-row work is just 
`eval(InternalRow(micros))`.
   - For scalar (literal-folded) formats — the common case — the cache lookup 
is hoisted out of the per-row loop to eliminate Tuple2 + HashMap.get 
allocations on the hot path.
   - The serde now always returns `Compatible` from `getSupportLevel`; the 
native-vs-UDF decision is made inside `convert`.
   
   ## Test plan
   
   - [x] `CometTemporalExpressionSuite` (Spark 3.5): all 27 tests pass
   - [x] Three existing fallback-reason tests rewritten to assert the JVM UDF 
path now runs inside Comet (`checkSparkAnswerAndOperator`)
   - [x] `date_format - timestamp_ntz input` now runs 
`checkSparkAnswerAndOperator` for **all** timezones (previously only UTC)
   - [x] `allowIncompatible` test fixed: corrected config key (`expression`, 
not `expr`) and switched from answer-comparison to operator-only assertion 
since the native path may legitimately diverge from Spark for non-UTC timezones
   - [ ] Spark 3.4 / 4.0 profile sanity
   - [ ] Benchmark UTC native path vs JVM UDF path on a representative workload
   
   ## Notes for reviewers
   
   This is a draft pending the items above and any feedback on the routing 
strategy in `CometDateFormat.convert`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: JVM UDF fallback for date_format [datafusion-comet]

Reply via email to