andygrove opened a new issue, #4310:
URL: https://github.com/apache/datafusion-comet/issues/4310
## What is the problem the feature request solves?
Comet has two regex engines (Rust DataFusion regex and a JVM-backed UDF
engine), and the current config model exposes two orthogonal axes for selecting
which one runs:
- `spark.comet.exec.regexp.engine` ∈ `{rust, java}` — picks the engine.
- `spark.comet.expression.<Expr>.allowIncompatible=true` (per regex
expression) and `spark.comet.expression.regexp.allowIncompatible=true`
(pattern-level) — opt in to running known-incompatible cases.
The axes interact non-orthogonally and the result is confusing:
- The Rust regex engine **can never be fully Spark/Java-regex compatible**
because of fundamental differences in regex semantics (`\d`, `\b`,
backreferences, lookaround, embedded flags, etc.). So opting into the Rust
engine is, in effect, opting into incompatibilities.
- Despite that, users currently still have to set per-expression
`allowIncompatible` flags on top of choosing `rust` to actually run
incompatible cases natively. The two opt-ins encode the same intent twice.
- Some regex expressions (`regexp_extract`, `regexp_extract_all`,
`regexp_instr`) have no Rust implementation at all. Today `engine=rust` causes
those to fall back to Spark, regardless of any `allowIncompatible` setting.
That's a third, implicit behavior layered on top of the two configs.
## Describe the potential solution
Treat the engine choice as the regex-wide incompatibility opt-in. Concretely:
1. Drop the per-expression `*.allowIncompatible=true` flags for regex
expressions (they become redundant).
2. Extend the existing `SupportLevel` ADT so an `Incompatible` result can
name the config that already opted the user in:
```scala
case class Incompatible(
notes: Option[String] = None,
optedInBy: Option[String] = None
) extends SupportLevel
```
3. The gating logic in `QueryPlanSerde.exprToProtoInternal` treats
`Incompatible(_, Some(key))` as opted-in when that config is set, in addition
to the existing per-expression `allowIncompatible` check.
4. Regex serdes read the engine config inside `getSupportLevel`:
```scala
// RegExpReplace, StringSplit, RLike, etc.: rust supported but
incompatible
override def getSupportLevel(expr: RegExpReplace): SupportLevel =
CometConf.regexpEngine.get match {
case "java" => Compatible(None)
case "rust" => Incompatible(
Some("Rust regex engine differs from Java in subtle ways"),
optedInBy = Some("spark.comet.exec.regexp.engine=rust"))
}
// RegExpExtract, RegExpExtractAll, RegExpInStr: java-only
override def getSupportLevel(expr: RegExpExtract): SupportLevel =
CometConf.regexpEngine.get match {
case "java" => Compatible(None)
case "rust" => Unsupported(Some(
"Only supported via the java engine; set
spark.comet.exec.regexp.engine=java"))
}
```
After this change the model collapses to a single knob:
- `engine=java` (default): every regex expression is fully Spark-compatible
and runs natively via the JVM UDF bridge (with a JNI cost per batch).
- `engine=rust`: regex expressions that have a Rust implementation run
natively, with known semantic differences. Expressions without a Rust
implementation fall back per the discussion point below. No per-expression
flags involved.
### Trade-offs
- `getSupportLevel` becomes context-dependent (reads the engine config).
That's a small philosophical shift: today's serdes don't read global config in
`getSupportLevel`. The alternative is passing a context object through
`getSupportLevel`, which would touch every existing serde. The cost of reading
`CometConf` from a few regex serdes seems worth avoiding that churn.
- Removes per-expression granularity for regex. A user who wants Rust for
`regexp_replace` but Java for `split` no longer has a way to express that. I
don't think that's a real use case, but worth flagging.
## Discussion point: fallback path when the chosen engine doesn't support an
expression
When the user picks `engine=rust` and runs `regexp_extract` (java-only),
there are three plausible behaviors:
1. **Fall back to Spark** (current behavior). The expression is tagged with
a reason, the planner declines to wrap it in `CometNativeExec`, and Spark
executes that operator. Pro: matches the user's explicit choice — they asked
for Rust and we don't silently swap in something else. Con: breaks the native
pipeline, losing native execution for the rest of the stage.
2. **Fall back to Comet's Java regex engine** (JVM UDF). The expression
still runs natively, just via the JVM UDF path instead of Rust. Pro: preserves
the native pipeline, likely better end-to-end perf for queries that mix
supported and unsupported regex. Con: the user said `engine=rust` and silently
got a Java JNI call. Surprising; harder to reason about perf.
3. **Add a third value `engine=rust-then-java`** (or `engine=auto`) that
explicitly opts into "Rust where possible, Java otherwise." Pure `engine=rust`
keeps option (1) semantics. Pro: makes the intent explicit and observable;
users who pick `rust-then-java` know they are accepting JNI for the unsupported
cases. Con: three engine values instead of two.
My current lean is **option 3** — it preserves "what you set is what you
get" for the binary `rust`/`java` cases and gives the hybrid path a name. But
I'd like input from others before committing.
Related: do we want the same fallback question to apply to *patterns* the
Rust engine can parse but doesn't handle correctly (the current
`regexp.allowIncompatible` territory)? Today those run incompatibly when the
user opts in; under option 3 we could route them through Java instead.
## Additional context
- This is shaped by #4239 (extending the JVM UDF framework to cover all
regex expressions), where the dual-axis model first became visible.
- Implementation note: the existing per-expression `allowIncompatible` flags
for non-regex expressions (e.g. cast) are unaffected; this change only touches
the regex family.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]