On 27/06/2009 7:00 AM, Jean-Christophe Deschamps wrote:
> At 13:25 26/06/2009, you wrote:
> ´¯¯¯
>> I am trying to find words in a dictionary stored in sqlite, and trying
>> a near miss approach.
>> For that I tried an algorithm to create patterns corresponding to
>> Levenshtein distance of 1 (edit distance of 1).
>> That means, one adition, one remotion or one substitution.
>>
>> Any hint on how to speed up this thing?
> `---
> 
> Hi,
> 
> I'm currently finishing an C extension offering, among other functions, 
> a "TYPOS" scalar operator which is meant to perform just that, and a 
> bit more.

There's a strong presumption that it doesn't handle CJK text, but what 
about alphabets other than Latin-based e.g. Arabic, Cyrillic, Greek, 
Hebrew, ...?

> Internally, it applies a Unicode fold() function,

What does fold() do? Strip off accents/umlauts/etc?

> a Unicode lower() 

upper() might be more suitable; consider the German eszett (U+00DF).

> function and then computes the Damerau-Levenshtein distance between the 
> strings.  It returns the number of insertions, omissions, change and 
> transposition (of adjacent letters only).

Consider an additional API which returns a scaled similarity score e.g 
1.0 - float(distance) / max(length(string1), length(string2))

> If the reference string is 'abcdef', it will return 1 (one typo) for
> 'abdef'     missing c
> 'abcudef'   u inserted
> 'abzef'     c changed into z
> 'abdcef'    c & d exchanged
> 
> It will also accept a trailing '%' in string2 acting as in LIKE.
> 
> You can use it this way:
> 
>    select * from t where typos(col, 'levencht%') <= 2;
> 
> or this way
> 
>    select typos(str1, str2)
> 
> The code currently makes use of a couple of Win32 functions, which 
> should have Un*x equivalent.  It runs at really decent speed even if I 
> didn't fight for optimization.  It will obviously outperform any SQL 
> solution by a large factor.

Does it use the icu library? What is the memory footprint?

Cheers,
John
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to