At 13:25 26/06/2009, you wrote:
´¯¯¯
>I am trying to find words in a dictionary stored in sqlite, and trying
>a near miss approach.
>For that I tried an algorithm to create patterns corresponding to
>Levenshtein distance of 1 (edit distance of 1).
>That means, one adition, one remotion or one substitution.
>
>Any hint on how to speed up this thing?
`---

Hi,

I'm currently finishing an C extension offering, among other functions, 
a "TYPOS" scalar operator which is meant to perform just that, and a 
bit more.

Internally, it applies a Unicode fold() function, a Unicode lower() 
function and then computes the Damerau-Levenshtein distance between the 
strings.  It returns the number of insertions, omissions, change and 
transposition (of adjacent letters only).

If the reference string is 'abcdef', it will return 1 (one typo) for
'abdef'     missing c
'abcudef'   u inserted
'abzef'     c changed into z
'abdcef'    c & d exchanged

It will also accept a trailing '%' in string2 acting as in LIKE.

You can use it this way:

   select * from t where typos(col, 'levencht%') <= 2;

or this way

   select typos(str1, str2)

The code currently makes use of a couple of Win32 functions, which 
should have Un*x equivalent.  It runs at really decent speed even if I 
didn't fight for optimization.  It will obviously outperform any SQL 
solution by a large factor.

I can't promise a very clean version tomorrow but just mail if you're 
interested in the C source. You could tailor it to your precise needs 
easily.

Hope it can help.


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to