Dandandan opened a new pull request, #22354:
URL: https://github.com/apache/datafusion/pull/22354
## Which issue does this PR close?
<!-- No issue yet; happy to file one if preferred. -->
- Closes #.
## Rationale for this change
`OptimizedRegex` (introduced for ClickBench Q28-style patterns) already
handles `^...(capture).*$` → `${1}` by stripping the trailing `.*$` and running
the shortened regex with reusable `CaptureLocations`. That avoids `expand()`
and string allocation, but still pays for full regex matching on every row.
For the common subset where the regex up to the capture reduces to a
**finite set of literal byte prefixes** and the capture has the form `([^X]+)X`
for a single ASCII byte `X`, we can do strictly better: parse the pattern's HIR
once at planning time, enumerate the literal prefix variants, and dispatch each
row to a `memchr`-based extractor — no regex engine involvement per row.
Patterns that fit this shape:
| Pattern | Prefix variants | Terminator |
|---|---|---|
| `^https?://(?:www\.)?([^/]+)/.*$` (ClickBench Q28) | `https://www.`,
`http://www.`, `https://`, `http://` | `/` |
| `^foo:([^,]+),.*$` | `foo:` | `,` |
| `^(?:foo\|bar\|baz):([^/]+)/.*$` | `bar:`, `baz:`, `foo:` | `/` |
Any pattern that doesn't fit (case-insensitive, multiline, non-ASCII
terminator, unbounded prefix) falls back to the existing `ShortenedRegex` path
with no change.
## Performance
Measured on ClickBench Q28 (`REGEXP_REPLACE(\"Referer\",
'^https?://(?:www\.)?([^/]+)/.*\$', '\1')`, partitioned dataset, `dfbench
--iterations 5 --query 28`, same machine):
| Build | Avg ms | Delta |
|---|---:|---:|
| Upstream main (shortened-regex only) | 4577.52 | baseline |
| Literal-prefix fast path | 2225.00 | **-51.4%** |
Per-iteration:
- Upstream: 4294.1, 4426.0, 4431.4, 5035.1, 4701.0
- Literal-prefix: 2417.5, 2178.1, 2244.0, 2154.8, 2130.6
## What changes are included in this PR?
- New `LiteralPrefixCaptureSpec` variant of the internal `ShortRegex` enum,
holding a deduplicated longest-first list of literal prefixes plus a
single-byte terminator.
- `try_recognize_literal_prefix_capture(pattern)` parses the pattern with
`regex-syntax::parse`, walks the HIR, enumerates prefix variants (bounded by
`MAX_PREFIX_VARIANTS = 32`), and verifies the capture is greedy `[^X]+`
followed by a literal `X`, `.*`, `\$`.
- Runtime extractor probes prefixes longest-first; on empty-capture failure
(the longest prefix consumes bytes the capture needs) it falls back to shorter
prefixes. That preserves the full regex's backtracking semantics for cases like
`http://www./path` against the Q28 pattern, where the regex prefers to leave
`www.` outside the optional so the capture is non-empty.
- New transitive direct dep: `regex-syntax` (already pulled in via `regex`,
version pinned to `0.8`). Wired through the `regex_expressions` feature so it's
only compiled when `regex` is.
## Are these changes tested?
New unit tests in `datafusion-functions::regex::regexpreplace::tests`:
- `literal_prefix_recognizer_accepts_clickbench_q28` — exact prefix list and
terminator for the Q28 pattern.
- `literal_prefix_recognizer_accepts_single_literal` and
`_accepts_alternation` — basic shapes.
- `literal_prefix_recognizer_rejects_non_anchored`,
`_rejects_unbounded_prefix`, `_rejects_non_ascii_terminator`,
`_rejects_case_insensitive` — guardrails on what the recognizer will and won't
accept.
- `literal_prefix_fast_path_matches_full_regex_for_q28_pattern` and
`_for_alternation_pattern` — differential tests that run the optimized path and
the full `regex::Regex` on the same inputs (including edge cases like
`http://www./path`, embedded `\n`, empty captures, and non-matching inputs) and
assert byte-equal output.
```
cargo test -p datafusion-functions --lib regex::regexpreplace
cargo clippy -p datafusion-functions --all-targets --all-features -- -D
warnings
```
All 22 tests in this module pass (13 pre-existing + 9 new).
## Are there any user-facing changes?
No. `regexp_replace` semantics are unchanged — any input that wouldn't go
through the fast path follows the exact same code as before, and inputs that do
go through the fast path are differentially tested against the full regex.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]