puranjay2597 opened a new pull request, #60799:
URL: https://github.com/apache/doris/pull/60799
Adds 5 new built-in scalar functions for fuzzy string matching and
similarity scoring, useful for record deduplication, search ranking, and data
quality workflows:
| Function | Return type | Description |
|---|---|---|
| `levenshtein(str1, str2)` | INT | |
| `damerau_levenshtein(str1, str2)` | INT | True Damerau-Levenshtein
distance (unrestricted transpositions) |
| `jaro_winkler(str1, str2)` | DOUBLE | Jaro-Winkler similarity [0.0, 1.0] |
| `cosine_similarity(str1, str2)` | DOUBLE | Cosine similarity on
byte-frequency vectors [0.0, 1.0] |
| `jaccard_similarity(str1, str2)` | DOUBLE | Jaccard similarity on byte
bigrams [0.0, 1.0] |
All functions accept `VARCHAR` and `STRING` inputs and propagate NULL.
### Release note
Add 5 built-in string similarity/distance functions: `levenshtein`,
`damerau_levenshtein`, `jaro_winkler`, `cosine_similarity`,
`jaccard_similarity`.
### Check List (For Author)
- Test
- [x] Manual test (add detailed scripts or steps below)
```sql
-- Levenshtein
SELECT levenshtein('kitten', 'sitting'); -- 3
SELECT levenshtein('', 'abc'); -- 3
-- Damerau-Levenshtein (true, not OSA)
SELECT damerau_levenshtein('ca', 'abc'); -- 2 (transposition
ca->ac + insert b)
SELECT damerau_levenshtein('kitten', 'sitting'); -- 3
-- Jaro-Winkler
SELECT jaro_winkler('MARTHA', 'MARHTA'); -- ~0.9611
SELECT jaro_winkler('abc', 'abc'); -- 1.0
-- Cosine similarity
SELECT cosine_similarity('hello', 'hello'); -- 1.0
SELECT cosine_similarity('hello', 'world'); -- ~0.516
-- Jaccard similarity
SELECT jaccard_similarity('night', 'nacht'); -- ~0.222
SELECT jaccard_similarity('abc', 'abc'); -- 1.0
```
### Performance
Benchmarked against equivalent Java UDFs on Apache Doris 4.0.2
**Test setup:** 100M rows × 96-char strings (md5-generated), `SELECT
count(func(a, b))` to isolate compute from I/O.
| Rows | C++ builtin | Java UDF | Speedup |
|---|---|---|---|
| 1M | 1.5s | 6.5s | 4.3x* |
| 100M | 2m 20s | 4m 25s | ~1.9x |
| 1B | 19m | 36m | ~1.89x |
| 400M production rows (repeated query) | 0.11s | 96s | ~870x |
\* 1M speedup is JVM-warmup-inflated; 100M and 1B represent fully
JIT-compiled steady state.
Key observations:
- **Consistent ~2x speedup** at scale (100M and 1B confirm the JVM reaches
full JIT compilation and the advantage stabilises).
- **Result cache**: C++ builtins benefit from Doris SQL-layer result caching
keyed on query text + data version. Java UDFs bypass the result cache entirely
and always recompute — explaining the ~870x gap on repeated queries.
- **Stack allocation**: `levenshtein`, `jaro_winkler`, and
`jaccard_similarity` use stack-allocated arrays for strings ≤512 bytes,
eliminating heap allocation for the common case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]