On Thu, Oct 12, 2023 at 01:29:27PM +0300, Eli Zaretskii wrote: > > Date: Thu, 12 Oct 2023 11:39:14 +0200 > > From: Patrice Dumas <pertu...@free.fr> > > > > One thing I could not find easily in C is something to replace the > > Unicode::Collate perl module for index entries sorting using 'smart' > > rules for sorting, that could be either found in Gnulib, included easily > > in the Texinfo distribution or would be, in general, installed. Unless > > I missed something, there is no such facility in libunistring, it seems > > to be in libICU, but I do not know how easy it could be > > integrated/shipped with Texinfo and I do not think that it is installed > > in the general case. > > > > > > Do you have information, on how to do 'smart' unicode sorting in > > C, including for tests, which could allow shipping of code as we already > > do with libunistring in gnulib in case it is not already installed, such > > that it is used in the general case? Could also be example of projects > > that have managed to do that. > > What is "smart sorting"? where is it described/documented?
It is, in general, any way to sort Unicode that takes into account natural languages words orders. In practice, what is used in Unicode::Collate is the 'Unicode Technical Standard #10' Unicode Collation Algorithm (a.k.a. UCA) described in http://www.unicode.org/reports/tr10. In texi2any, we set an option of collation, ( 'variable' => 'Non-Ignorable' ) such that spaces and punctuation marks sort before letters. This specific option is described in http://www.unicode.org/reports/tr10/#Variable_Weighting It would be perfect if the same sorting could be obtained, but if C code does not follow exactly the same standard, I do not think that it is so problematic, as long as the sorting is sensible. It could actually be problematic for tests, but if the output of texi2any is ok even if not fully reproducible, it would still be better than sorting according to the Unicode codepoint in a full C implementation. > In general, Unicode collation rules are locale- and > language-dependent. My recommendation for Texinfo is not to use > locale-specific collation rules, so that the indices would come out > sorted identically no matter in which locale the user runs texi2any. That's the plan. The plan is to use the @documentlanguage information with Unicode::Collate::Locale in the future, but never use the locale. This is still a TODO item, though, as Unicode::Collate::Locale is a perl core module since perl 5.14 only, released in 2011, so my plan was to wait for 2031 to use it and be able to assume that it is indeed present the same way we assume that Unicode::Collate is present. -- Pat