On 7 May 2012 04:10, Brian DeRocher <[email protected]> wrote: > I disagree that make_standard_string() needs to work for all major > languages. I mean it does, but you should know the language ahead of time. > Since the problem of standardizing words is language related, can you first > use the HTTP header Accept-Language to pick the language (or use geoip), and > then standardize according to rules of that language?
I think you may be slightly missinterpretting the purpose of make_standard_string. It is not so much an attempt to clean up the string as it is an attempt to generate a standard simplified search token with as high a chance to including the required search result and as small as possible other collisions. It probably makes more sense to think of it as a specialised Metaphone type algorithm. The fact that it produces such poor tokens for the cases given is an issue. We really do need to do something about that since we get too high a search space at the moment for short tokens. It may be as simple as adding a length constraint for some of the rules - although the overlaps between us states and words for 'the' in other languages is annoying! If you are able to find a way to improve the token generator (given these constraints) obviously it would be welcome! -- Brian _______________________________________________ Geocoding mailing list [email protected] http://lists.openstreetmap.org/listinfo/geocoding

