On 27/06/2009 7:00 AM, Jean-Christophe Deschamps wrote: > At 13:25 26/06/2009, you wrote: > ´¯¯¯ >> I am trying to find words in a dictionary stored in sqlite, and trying >> a near miss approach. >> For that I tried an algorithm to create patterns corresponding to >> Levenshtein distance of 1 (edit distance of 1). >> That means, one adition, one remotion or one substitution. >> >> Any hint on how to speed up this thing? > `--- > > Hi, > > I'm currently finishing an C extension offering, among other functions, > a "TYPOS" scalar operator which is meant to perform just that, and a > bit more.
There's a strong presumption that it doesn't handle CJK text, but what about alphabets other than Latin-based e.g. Arabic, Cyrillic, Greek, Hebrew, ...? > Internally, it applies a Unicode fold() function, What does fold() do? Strip off accents/umlauts/etc? > a Unicode lower() upper() might be more suitable; consider the German eszett (U+00DF). > function and then computes the Damerau-Levenshtein distance between the > strings. It returns the number of insertions, omissions, change and > transposition (of adjacent letters only). Consider an additional API which returns a scaled similarity score e.g 1.0 - float(distance) / max(length(string1), length(string2)) > If the reference string is 'abcdef', it will return 1 (one typo) for > 'abdef' missing c > 'abcudef' u inserted > 'abzef' c changed into z > 'abdcef' c & d exchanged > > It will also accept a trailing '%' in string2 acting as in LIKE. > > You can use it this way: > > select * from t where typos(col, 'levencht%') <= 2; > > or this way > > select typos(str1, str2) > > The code currently makes use of a couple of Win32 functions, which > should have Un*x equivalent. It runs at really decent speed even if I > didn't fight for optimization. It will obviously outperform any SQL > solution by a large factor. Does it use the icu library? What is the memory footprint? Cheers, John _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users