andygrove opened a new issue, #4310:
URL: https://github.com/apache/datafusion-comet/issues/4310

   ## What is the problem the feature request solves?
   
   Comet has two regex engines (Rust DataFusion regex and a JVM-backed UDF 
engine), and the current config model exposes two orthogonal axes for selecting 
which one runs:
   
   - `spark.comet.exec.regexp.engine` ∈ `{rust, java}` — picks the engine.
   - `spark.comet.expression.<Expr>.allowIncompatible=true` (per regex 
expression) and `spark.comet.expression.regexp.allowIncompatible=true` 
(pattern-level) — opt in to running known-incompatible cases.
   
   The axes interact non-orthogonally and the result is confusing:
   
   - The Rust regex engine **can never be fully Spark/Java-regex compatible** 
because of fundamental differences in regex semantics (`\d`, `\b`, 
backreferences, lookaround, embedded flags, etc.). So opting into the Rust 
engine is, in effect, opting into incompatibilities.
   - Despite that, users currently still have to set per-expression 
`allowIncompatible` flags on top of choosing `rust` to actually run 
incompatible cases natively. The two opt-ins encode the same intent twice.
   - Some regex expressions (`regexp_extract`, `regexp_extract_all`, 
`regexp_instr`) have no Rust implementation at all. Today `engine=rust` causes 
those to fall back to Spark, regardless of any `allowIncompatible` setting. 
That's a third, implicit behavior layered on top of the two configs.
   
   ## Describe the potential solution
   
   Treat the engine choice as the regex-wide incompatibility opt-in. Concretely:
   
   1. Drop the per-expression `*.allowIncompatible=true` flags for regex 
expressions (they become redundant).
   2. Extend the existing `SupportLevel` ADT so an `Incompatible` result can 
name the config that already opted the user in:
   
      ```scala
      case class Incompatible(
          notes: Option[String] = None,
          optedInBy: Option[String] = None
      ) extends SupportLevel
      ```
   
   3. The gating logic in `QueryPlanSerde.exprToProtoInternal` treats 
`Incompatible(_, Some(key))` as opted-in when that config is set, in addition 
to the existing per-expression `allowIncompatible` check.
   
   4. Regex serdes read the engine config inside `getSupportLevel`:
   
      ```scala
      // RegExpReplace, StringSplit, RLike, etc.: rust supported but 
incompatible
      override def getSupportLevel(expr: RegExpReplace): SupportLevel =
        CometConf.regexpEngine.get match {
          case "java" => Compatible(None)
          case "rust" => Incompatible(
            Some("Rust regex engine differs from Java in subtle ways"),
            optedInBy = Some("spark.comet.exec.regexp.engine=rust"))
        }
   
      // RegExpExtract, RegExpExtractAll, RegExpInStr: java-only
      override def getSupportLevel(expr: RegExpExtract): SupportLevel =
        CometConf.regexpEngine.get match {
          case "java" => Compatible(None)
          case "rust" => Unsupported(Some(
            "Only supported via the java engine; set 
spark.comet.exec.regexp.engine=java"))
        }
      ```
   
   After this change the model collapses to a single knob:
   
   - `engine=java` (default): every regex expression is fully Spark-compatible 
and runs natively via the JVM UDF bridge (with a JNI cost per batch).
   - `engine=rust`: regex expressions that have a Rust implementation run 
natively, with known semantic differences. Expressions without a Rust 
implementation fall back per the discussion point below. No per-expression 
flags involved.
   
   ### Trade-offs
   
   - `getSupportLevel` becomes context-dependent (reads the engine config). 
That's a small philosophical shift: today's serdes don't read global config in 
`getSupportLevel`. The alternative is passing a context object through 
`getSupportLevel`, which would touch every existing serde. The cost of reading 
`CometConf` from a few regex serdes seems worth avoiding that churn.
   - Removes per-expression granularity for regex. A user who wants Rust for 
`regexp_replace` but Java for `split` no longer has a way to express that. I 
don't think that's a real use case, but worth flagging.
   
   ## Discussion point: fallback path when the chosen engine doesn't support an 
expression
   
   When the user picks `engine=rust` and runs `regexp_extract` (java-only), 
there are three plausible behaviors:
   
   1. **Fall back to Spark** (current behavior). The expression is tagged with 
a reason, the planner declines to wrap it in `CometNativeExec`, and Spark 
executes that operator. Pro: matches the user's explicit choice — they asked 
for Rust and we don't silently swap in something else. Con: breaks the native 
pipeline, losing native execution for the rest of the stage.
   
   2. **Fall back to Comet's Java regex engine** (JVM UDF). The expression 
still runs natively, just via the JVM UDF path instead of Rust. Pro: preserves 
the native pipeline, likely better end-to-end perf for queries that mix 
supported and unsupported regex. Con: the user said `engine=rust` and silently 
got a Java JNI call. Surprising; harder to reason about perf.
   
   3. **Add a third value `engine=rust-then-java`** (or `engine=auto`) that 
explicitly opts into "Rust where possible, Java otherwise." Pure `engine=rust` 
keeps option (1) semantics. Pro: makes the intent explicit and observable; 
users who pick `rust-then-java` know they are accepting JNI for the unsupported 
cases. Con: three engine values instead of two.
   
   My current lean is **option 3** — it preserves "what you set is what you 
get" for the binary `rust`/`java` cases and gives the hybrid path a name. But 
I'd like input from others before committing.
   
   Related: do we want the same fallback question to apply to *patterns* the 
Rust engine can parse but doesn't handle correctly (the current 
`regexp.allowIncompatible` territory)? Today those run incompatibly when the 
user opts in; under option 3 we could route them through Java instead.
   
   ## Additional context
   
   - This is shaped by #4239 (extending the JVM UDF framework to cover all 
regex expressions), where the dual-axis model first became visible.
   - Implementation note: the existing per-expression `allowIncompatible` flags 
for non-regex expressions (e.g. cast) are unaffected; this change only touches 
the regex family.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to