On 8/26/2013 1:26 PM, _ph_ wrote:
Should "AD" + "ZV" really compare as a "A" + "DZ" digraph +"V" in the respective language? I am not sure about the intended behavior, but it seems strange. (OTOH, language. It's always strange.)
In Hungarian, yes, that's what happens.
Anyway, I would definitely unicode-normalize the strings *before* putting them into the database. You might avoid the special handling for the digraphs if you normalize /towards/ the digraph code points: only strings actually containing digraphs would escape your optimization.
There are no separate code points one could normalize to. These languages use normal ASCII letters, but sorting is more complex than letter-by-letter comparison. That's not all that unusual: even in English, you might want to sort Muenster and Münster next to each other.
By the way, the correct name for such sequences is not "digraphs", but "contractions": http://www.unicode.org/reports/tr10/#Contractions
-- Igor Tandetnik _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users