On 8/26/2013 1:26 PM, _ph_ wrote:
Should "AD" + "ZV" really compare as a "A" + "DZ" digraph +"V" in the
respective language? I am not sure about the intended behavior, but it seems
strange. (OTOH, language. It's always strange.)

In Hungarian, yes, that's what happens.

Anyway, I would definitely  unicode-normalize the strings *before* putting
them into the database. You might avoid the special handling for the
digraphs if you normalize /towards/ the digraph code points: only strings
actually containing digraphs would escape your optimization.

There are no separate code points one could normalize to. These languages use normal ASCII letters, but sorting is more complex than letter-by-letter comparison. That's not all that unusual: even in English, you might want to sort Muenster and Münster next to each other.

By the way, the correct name for such sequences is not "digraphs", but "contractions": http://www.unicode.org/reports/tr10/#Contractions
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to