Re: library for unicode collation in C for texi2any?

Gavin Smith Sat, 14 Oct 2023 02:07:52 -0700

On Fri, Oct 13, 2023 at 07:31:29AM +0000, Werner LEMBERG wrote:
> 
> >> OK, no tailoring.  I wasn't aware of those differences, thanks for
> >> pointing me to it.
> >> 
> >> Hopefully, we agree that `@documentlanguage` should set a
> >> language-specific collation for the index.
> > 
> > Without tailoring, this basically means collation according to
> > Unicode codepoints.
> 
> Uh oh, this is not good.  As an example, consider the letter 'ä'.
> There are two possible collations that are considered as correct for
> German:
> 
> * Sort 'ä' right before 'b'.
> 
> * Handle 'ä' similar to 'ae' but sort it after 'ae'.
> 
> Neither collation corresponds to Unicode codepoints.


I think there is some confusion here.  The Unicode Collation Algorithm
does not simply order by codepoint.  So the unicode codepoint for 'ä' (U+00E4)
is not compared numerically to that for 'a' (U+0061) at any point.

See https://www.unicode.org/reports/tr10/#Collation_And_Code_Chart_Order.

> The basic principle to remember is: The position of characters in the
> Unicode code charts does not specify their sort order.

As far as I understand, there is a default ordering where ä will be sorted
after a.  A "multilevel" ordering is used, giving the following ordering

a
ä
ab
äb
z

rather than

a
ab
ä
äb
z

which is what would happen if ä was simply treated as a letter between
a and b.

"Tailoring" is a further language-dependent alteration to the collation
algorithm.  The TR10 document gives the example of Swedish where 'ä' would be
its own letter and sort after 'z':

a
ab
z
ä
äb

Nobody is arguing for "codepoint-order" sorting, but what is in question
here is whether there should be this latter language-dependent alteration
of the sorting order.  This alteration may be good in theory but it remains
to be seen how practical it is to achieve.

Re: library for unicode collation in C for texi2any?

Reply via email to