At 13:25 26/06/2009, you wrote: ´¯¯¯ >I am trying to find words in a dictionary stored in sqlite, and trying >a near miss approach. >For that I tried an algorithm to create patterns corresponding to >Levenshtein distance of 1 (edit distance of 1). >That means, one adition, one remotion or one substitution. > >Any hint on how to speed up this thing? `---
Hi, I'm currently finishing an C extension offering, among other functions, a "TYPOS" scalar operator which is meant to perform just that, and a bit more. Internally, it applies a Unicode fold() function, a Unicode lower() function and then computes the Damerau-Levenshtein distance between the strings. It returns the number of insertions, omissions, change and transposition (of adjacent letters only). If the reference string is 'abcdef', it will return 1 (one typo) for 'abdef' missing c 'abcudef' u inserted 'abzef' c changed into z 'abdcef' c & d exchanged It will also accept a trailing '%' in string2 acting as in LIKE. You can use it this way: select * from t where typos(col, 'levencht%') <= 2; or this way select typos(str1, str2) The code currently makes use of a couple of Win32 functions, which should have Un*x equivalent. It runs at really decent speed even if I didn't fight for optimization. It will obviously outperform any SQL solution by a large factor. I can't promise a very clean version tomorrow but just mail if you're interested in the C source. You could tailor it to your precise needs easily. Hope it can help. _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users