Dennis Sweeney <[email protected]> added the comment:
Some research of other projects:
LLVM [1][2]
-----------
- Compute Levenshtein
- Using O(n) memory rather than O(n^2)
- Uses UpperBound = (len(typo) + 2) // 3
GCC [3]
-------
- Uses Damerau-Levenshtein distance
- Counts transpositions like "abcd" <-> "bacd" as one move
- Swapping Case as in "a" <-> "A" counts as half a move
- cutoff = (longer + 2) // 3 if longer - shorter >= 2 else max(longer // 3, 1)
Rust [4]
--------
- "maximum allowable edit distance defaults to one-third of the given word."
- First checks for exact case-insensitive match, then check for Levenshtein
distance small enough, then check if sorted(a.split("_")) ==
sorted(b.split("_"))
Ruby [5]
--------
- Quickly filter out words with bad Jaro–Winkler distance
- threshold = input.length > 3 ? 0.834 : 0.77
- Only compute Levenshtein for words that remain
- threshold = (input.length * 0.25).ceil
- Output all good enough words
- If no word was good enough then output the closest match.
I think there are some good ideas here.
[1]
https://github.com/llvm/llvm-project/blob/d480f968ad8b56d3ee4a6b6df5532d485b0ad01e/llvm/include/llvm/ADT/edit_distance.h#L42
[2]
https://github.com/llvm/llvm-project/blob/e2b3b89bf1ce74bf889897e0353a3e3fa93e4452/clang/lib/Sema/SemaLookup.cpp#L4263
[3]
https://github.com/gcc-mirror/gcc/blob/16e2427f50c208dfe07d07f18009969502c25dc8/gcc/spellcheck.c
[4]
https://github.com/rust-lang/rust/blob/673d0db5e393e9c64897005b470bfeb6d5aec61b/compiler/rustc_span/src/lev_distance.rs#L44
[5]
https://github.com/ruby/ruby/blob/48b94b791997881929c739c64f95ac30f3fd0bb9/lib/did_you_mean/spell_checker.rb
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue38530>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com