On 17/06/2009 1:19 AM, Christophe Leske wrote: >>> So far , so good, but my client also expects ANY simplification of a >>> character to be recognized: >>> Cote d'azur for instance should return "Côte d'azur" >>> or the Sao Paulo issue - how can a search for "Sao Paulo" return "Sào >>> Paulo" in the result set? >>> >> How are these examples different from previous ones? >> > I am sorry, but I find this to be quite obvious? > Here, the problematic char is to be found in the *result set*, not in > the query itself.
It's NOT different. You need to map BOTH your database values and your query values into the same space and then compare them. Don't fall into the trap of assuming that your database is correctly accented. > > How do you educate SQlite to return me "Sào Paulo" if only "Sao Paulo" > is being entered? > How do I know which character to substitute with a placeholder? > > Is it > S%o Paulo to look for? > Or Sa% Paulo? > Or Sao P%ulo? > > I can't know this beforehand. These are just examples, i need a generic > solution if possivble. > > All i can see so far is to build a table of all special characters ever > used in the 24000 names of cities which make problems and remap them > accordingly. That's exactly what you need. And you're not the first person with this problem. See for example http://mail.python.org/pipermail/python-list/2008-July/669592.html The technique discussed there starts off with using the unicodedata database and finding dynamically (and caching) Unicode characters that can be decomposed into a basic latin letter plus one or more accents, backed up by a table of cases not found by that technique. Great for likers of clever code who have lots of CPU and disk space (unicodeddata is huge!) to spare. I have developed a table which maps most latin-decorated Unicode characters into the non-decorated basic form. Sometimes 2 ASCII characters will be produced (e.g. latin capital letter thorn -> "Th") but latin small letter u with diaeresis -> "u" -- not "ue" which is German-specific. I can let you have a copy if you are interested. What is your implementation language? C/C++? BTW someone mentioned smashing everything into lowercase for comparison purposes at some stage -- I'd suggest uppercase especially if you have a few of the good old eszett in your data :-) BTW2: The only sane usage of soundex IMHO is as a strawman when proposing phonetic matching algorithms like NYSIIS and [Double ]Metaphone :-) HTH, John _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users