Re: library for unicode collation in C for texi2any?

Eli Zaretskii Thu, 12 Oct 2023 08:14:28 -0700

> Date: Thu, 12 Oct 2023 15:00:57 +0200
> From: Patrice Dumas <[email protected]>
> Cc: [email protected]
> 
> On Thu, Oct 12, 2023 at 01:29:27PM +0300, Eli Zaretskii wrote:
> > What is "smart sorting"? where is it described/documented?
> 
> It is, in general, any way to sort Unicode that takes into account
> natural languages words orders. In practice, what is used in
> Unicode::Collate is the 'Unicode Technical Standard #10' Unicode
> Collation Algorithm (a.k.a. UCA) described in
> http://www.unicode.org/reports/tr10.  In texi2any, we set an option of
> collation,
>   ( 'variable' => 'Non-Ignorable' )
> such that spaces and punctuation marks sort before letters.  This
> specific option is described in
> http://www.unicode.org/reports/tr10/#Variable_Weighting
> 
> It would be perfect if the same sorting could be obtained, but if
> C code does not follow exactly the same standard, I do not think
> that it is so problematic, as long as the sorting is sensible.  It could
> actually be problematic for tests, but if the output of texi2any is ok
> even if not fully reproducible, it would still be better than sorting
> according to the Unicode codepoint in a full C implementation.


What you say is not detailed enough, but using my crystal ball I think
you can have this with glibc-based systems, and also on Windows (but
that requires using a special API for comparing strings).  Not sure
about the equivalent features on other systems, like *BSD and macOS.
You can see that in action in how GNU 'ls' sorts file names.

> > In general, Unicode collation rules are locale- and
> > language-dependent.  My recommendation for Texinfo is not to use
> > locale-specific collation rules, so that the indices would come out
> > sorted identically no matter in which locale the user runs texi2any.
> 
> That's the plan.  The plan is to use the @documentlanguage information
> with Unicode::Collate::Locale in the future, but never use the locale.

I don't recommend to tailor index sorting for the language indicated
by @documentlanguage, either.

> This is still a TODO item, though, as Unicode::Collate::Locale is a perl
> core module since perl 5.14 only, released in 2011, so my plan was to
> wait for 2031 to use it and be able to assume that it is indeed present
> the same way we assume that Unicode::Collate is present.

We can have this in C today.

Re: library for unicode collation in C for texi2any?

Reply via email to