On 8/27/2013 6:50 AM, Jan Slodicka wrote:
That's not all that unusual: even in English, you might want to sort
Muenster and Münster next to each other.
Thanks, Igor. Do you know more? Do you consider ascii comparison too
dangerous?
At one point, we did in our project the same thing you are trying now:
check if both strings are pure ASCII then compare them the fast way
(equivalent to memcpy, though we didn't use it but did the checking and
comparison together, in one pass); otherwise fall back to the
OS-provided locale-sensitive comparison.
In the end, we discovered ICU: it manages to be much faster than the OS
comparisons (not exactly surprising), and even slightly faster than our
hand-written check-and-compare-ASCII loop, while being correct for all
locales. Ours is a desktop application, not resource constrained, so
bundling ICU with it was not a problem.
Here's the summary of all the cases I know of where simple ASCII
comparison does the wrong thing (which doesn't mean there aren't others
I don't know of):
- Contractions in various Latin-script-using Eastern-European languages
(like Hungarian) you are already aware of.
- Several contractions in Welsh:
http://en.wikipedia.org/wiki/Welsh_language#Orthography
- German phonebook sort, that puts AE between A and B, OE between O and
P, and UE between U and V. German defines two sorts, called "dictionary"
and "phonebook", which differ only in whether these contractions are
used. On Windows, the user can configure which sort to use.
- Spanish traditional sort (as opposed to modern sort) puts CH between C
and D, and LL between L and M. No longer used for anything but the
academic linguistic studies, can be safely ignored.
- Finnish treats W as a variant of V (it's considered a secondary
distinction, like that between A and Á).
- Lithuanian puts Y between I and J
--
Igor Tandetnik
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users