Re: library for unicode collation in C for texi2any?

Patrice Dumas Thu, 12 Oct 2023 06:01:25 -0700

On Thu, Oct 12, 2023 at 01:29:27PM +0300, Eli Zaretskii wrote:
> > Date: Thu, 12 Oct 2023 11:39:14 +0200
> > From: Patrice Dumas <[email protected]>
> > 
> > One thing I could not find easily in C is something to replace the
> > Unicode::Collate perl module for index entries sorting using 'smart'
> > rules for sorting, that could be either found in Gnulib, included easily
> > in the Texinfo distribution or would be, in general, installed.  Unless
> > I missed something, there is no such facility in libunistring, it seems
> > to be in libICU, but I do not know how easy it could be
> > integrated/shipped with Texinfo and I do not think that it is installed
> > in the general case.
> > 
> > 
> > Do you have information, on how to do 'smart' unicode sorting in
> > C, including for tests, which could allow shipping of code as we already
> > do with libunistring in gnulib in case it is not already installed, such
> > that it is used in the general case?  Could also be example of projects
> > that have managed to do that.
> 
> What is "smart sorting"? where is it described/documented?


It is, in general, any way to sort Unicode that takes into account
natural languages words orders. In practice, what is used in
Unicode::Collate is the 'Unicode Technical Standard #10' Unicode
Collation Algorithm (a.k.a. UCA) described in
http://www.unicode.org/reports/tr10.  In texi2any, we set an option of
collation,
  ( 'variable' => 'Non-Ignorable' )
such that spaces and punctuation marks sort before letters.  This
specific option is described in
http://www.unicode.org/reports/tr10/#Variable_Weighting

It would be perfect if the same sorting could be obtained, but if
C code does not follow exactly the same standard, I do not think
that it is so problematic, as long as the sorting is sensible.  It could
actually be problematic for tests, but if the output of texi2any is ok
even if not fully reproducible, it would still be better than sorting
according to the Unicode codepoint in a full C implementation.

> In general, Unicode collation rules are locale- and
> language-dependent.  My recommendation for Texinfo is not to use
> locale-specific collation rules, so that the indices would come out
> sorted identically no matter in which locale the user runs texi2any.

That's the plan.  The plan is to use the @documentlanguage information
with Unicode::Collate::Locale in the future, but never use the locale.
This is still a TODO item, though, as Unicode::Collate::Locale is a perl
core module since perl 5.14 only, released in 2011, so my plan was to
wait for 2031 to use it and be able to assume that it is indeed present
the same way we assume that Unicode::Collate is present.

-- 
Pat

Re: library for unicode collation in C for texi2any?

Reply via email to