On Fri, Oct 13, 2023 at 07:31:29AM +0000, Werner LEMBERG wrote: > > >> OK, no tailoring. I wasn't aware of those differences, thanks for > >> pointing me to it. > >> > >> Hopefully, we agree that `@documentlanguage` should set a > >> language-specific collation for the index. > > > > Without tailoring, this basically means collation according to > > Unicode codepoints. > > Uh oh, this is not good. As an example, consider the letter 'ä'. > There are two possible collations that are considered as correct for > German: > > * Sort 'ä' right before 'b'. > > * Handle 'ä' similar to 'ae' but sort it after 'ae'. > > Neither collation corresponds to Unicode codepoints.
I think there is some confusion here. The Unicode Collation Algorithm does not simply order by codepoint. So the unicode codepoint for 'ä' (U+00E4) is not compared numerically to that for 'a' (U+0061) at any point. See https://www.unicode.org/reports/tr10/#Collation_And_Code_Chart_Order. > The basic principle to remember is: The position of characters in the > Unicode code charts does not specify their sort order. As far as I understand, there is a default ordering where ä will be sorted after a. A "multilevel" ordering is used, giving the following ordering a ä ab äb z rather than a ab ä äb z which is what would happen if ä was simply treated as a letter between a and b. "Tailoring" is a further language-dependent alteration to the collation algorithm. The TR10 document gives the example of Swedish where 'ä' would be its own letter and sort after 'z': a ab z ä äb Nobody is arguing for "codepoint-order" sorting, but what is in question here is whether there should be this latter language-dependent alteration of the sorting order. This alteration may be good in theory but it remains to be seen how practical it is to achieve.