Hello group, I'm writing a fuzzy search extension. The current code is getting a little messy and I'm not completely satisfied by the way it works. So I'm about to rewrite it from scratch on stronger foundations.
The goal is to provide a fuzzy search on _short_ fields like names, street adress, city or similar typical user input that has to be consolidated against an existing base. I'm in no way aiming at searching a fuzzy substring inside a large text. The current underlying algorithm would be terrible for that (I use a Damerau-Levenshtein distance for counting typos). I need to deal with codepoints that would expand to several individual characters. Examples are ligatures or fractions. I've never seen ligatures used in French, nor in any european language, when it comes to user input. I believe such ligatures are more a typesetting or word processing finesse which is beyond most users care / knowledge. But if I ever encounter some, how should I deal with them? If I leave them alone, then for instance ligature 'fi' would not compare to the letter sequence 'f' 'i'. If I expand them, then ligature 'fi' would get to 'f' 'i' but if the corresponding char in the second string is 'g' then it would count for two errors instead of one. Things could get worse in more "exotic" (to me) scripts. The same questions arise for upper, lower, ... functions. Again, the goal is certainly not to duplicate ICU with all its complexity but to offer a decent code base to achieve mostly correct results in as many languages as possible, keeping in mind that the memory footprint and the code overhead should be kept reasonable. Feel free to give advises on how to deal with those issues. _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users