On 8/27/2013 6:50 AM, Jan Slodicka wrote:

That's not all that unusual: even in English, you might want to sort
Muenster and Münster next to each other.

Thanks, Igor. Do you know more? Do you consider ascii comparison too
dangerous?

At one point, we did in our project the same thing you are trying now: check if both strings are pure ASCII then compare them the fast way (equivalent to memcpy, though we didn't use it but did the checking and comparison together, in one pass); otherwise fall back to the OS-provided locale-sensitive comparison.

In the end, we discovered ICU: it manages to be much faster than the OS comparisons (not exactly surprising), and even slightly faster than our hand-written check-and-compare-ASCII loop, while being correct for all locales. Ours is a desktop application, not resource constrained, so bundling ICU with it was not a problem.

Here's the summary of all the cases I know of where simple ASCII comparison does the wrong thing (which doesn't mean there aren't others I don't know of):

- Contractions in various Latin-script-using Eastern-European languages (like Hungarian) you are already aware of.

- Several contractions in Welsh:
http://en.wikipedia.org/wiki/Welsh_language#Orthography

- German phonebook sort, that puts AE between A and B, OE between O and P, and UE between U and V. German defines two sorts, called "dictionary" and "phonebook", which differ only in whether these contractions are used. On Windows, the user can configure which sort to use.

- Spanish traditional sort (as opposed to modern sort) puts CH between C and D, and LL between L and M. No longer used for anything but the academic linguistic studies, can be safely ignored.

- Finnish treats W as a variant of V (it's considered a secondary distinction, like that between A and Á).

- Lithuanian puts Y between I and J

--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to