Dandandan opened a new pull request, #22354:
URL: https://github.com/apache/datafusion/pull/22354

   ## Which issue does this PR close?
   
   <!-- No issue yet; happy to file one if preferred. -->
   
   - Closes #.
   
   ## Rationale for this change
   
   `OptimizedRegex` (introduced for ClickBench Q28-style patterns) already 
handles `^...(capture).*$` → `${1}` by stripping the trailing `.*$` and running 
the shortened regex with reusable `CaptureLocations`. That avoids `expand()` 
and string allocation, but still pays for full regex matching on every row.
   
   For the common subset where the regex up to the capture reduces to a 
**finite set of literal byte prefixes** and the capture has the form `([^X]+)X` 
for a single ASCII byte `X`, we can do strictly better: parse the pattern's HIR 
once at planning time, enumerate the literal prefix variants, and dispatch each 
row to a `memchr`-based extractor — no regex engine involvement per row.
   
   Patterns that fit this shape:
   
   | Pattern | Prefix variants | Terminator |
   |---|---|---|
   | `^https?://(?:www\.)?([^/]+)/.*$` (ClickBench Q28) | `https://www.`, 
`http://www.`, `https://`, `http://` | `/` |
   | `^foo:([^,]+),.*$` | `foo:` | `,` |
   | `^(?:foo\|bar\|baz):([^/]+)/.*$` | `bar:`, `baz:`, `foo:` | `/` |
   
   Any pattern that doesn't fit (case-insensitive, multiline, non-ASCII 
terminator, unbounded prefix) falls back to the existing `ShortenedRegex` path 
with no change.
   
   ## Performance
   
   Measured on ClickBench Q28 (`REGEXP_REPLACE(\"Referer\", 
'^https?://(?:www\.)?([^/]+)/.*\$', '\1')`, partitioned dataset, `dfbench 
--iterations 5 --query 28`, same machine):
   
   | Build | Avg ms | Delta |
   |---|---:|---:|
   | Upstream main (shortened-regex only) | 4577.52 | baseline |
   | Literal-prefix fast path | 2225.00 | **-51.4%** |
   
   Per-iteration:
   
   - Upstream:        4294.1, 4426.0, 4431.4, 5035.1, 4701.0
   - Literal-prefix:  2417.5, 2178.1, 2244.0, 2154.8, 2130.6
   
   ## What changes are included in this PR?
   
   - New `LiteralPrefixCaptureSpec` variant of the internal `ShortRegex` enum, 
holding a deduplicated longest-first list of literal prefixes plus a 
single-byte terminator.
   - `try_recognize_literal_prefix_capture(pattern)` parses the pattern with 
`regex-syntax::parse`, walks the HIR, enumerates prefix variants (bounded by 
`MAX_PREFIX_VARIANTS = 32`), and verifies the capture is greedy `[^X]+` 
followed by a literal `X`, `.*`, `\$`.
   - Runtime extractor probes prefixes longest-first; on empty-capture failure 
(the longest prefix consumes bytes the capture needs) it falls back to shorter 
prefixes. That preserves the full regex's backtracking semantics for cases like 
`http://www./path` against the Q28 pattern, where the regex prefers to leave 
`www.` outside the optional so the capture is non-empty.
   - New transitive direct dep: `regex-syntax` (already pulled in via `regex`, 
version pinned to `0.8`). Wired through the `regex_expressions` feature so it's 
only compiled when `regex` is.
   
   ## Are these changes tested?
   
   New unit tests in `datafusion-functions::regex::regexpreplace::tests`:
   
   - `literal_prefix_recognizer_accepts_clickbench_q28` — exact prefix list and 
terminator for the Q28 pattern.
   - `literal_prefix_recognizer_accepts_single_literal` and 
`_accepts_alternation` — basic shapes.
   - `literal_prefix_recognizer_rejects_non_anchored`, 
`_rejects_unbounded_prefix`, `_rejects_non_ascii_terminator`, 
`_rejects_case_insensitive` — guardrails on what the recognizer will and won't 
accept.
   - `literal_prefix_fast_path_matches_full_regex_for_q28_pattern` and 
`_for_alternation_pattern` — differential tests that run the optimized path and 
the full `regex::Regex` on the same inputs (including edge cases like 
`http://www./path`, embedded `\n`, empty captures, and non-matching inputs) and 
assert byte-equal output.
   
   ```
   cargo test -p datafusion-functions --lib regex::regexpreplace
   cargo clippy -p datafusion-functions --all-targets --all-features -- -D 
warnings
   ```
   
   All 22 tests in this module pass (13 pre-existing + 9 new).
   
   ## Are there any user-facing changes?
   
   No. `regexp_replace` semantics are unchanged — any input that wouldn't go 
through the fast path follows the exact same code as before, and inputs that do 
go through the fast path are differentially tested against the full regex.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to