On Sun, Feb 04, 2024 at 12:55:36PM +0200, Eli Zaretskii wrote: > > Date: Sun, 4 Feb 2024 11:42:52 +0100 > > From: pertu...@free.fr > > Cc: Gavin Smith <gavinsmith0...@gmail.com>, bug-texinfo@gnu.org > > > > On Fri, Feb 02, 2024 at 08:57:01AM +0200, Eli Zaretskii wrote: > > > I think en_US.utf-8 is (or at least can be by default) a combination > > > of @documentlanguage and @documentencoding. > > > > I try to make the index collation as independent as possible of > > @documentencoding and output encoding. Here the utf-8 is meant to > > provide a sorting 'independent' of the encoding. > > Why is that a good idea? Presumably, a manual whose language is > provided by @documentlanguage is indeed written in that language, and > so the collation should be according to that language? Or what am I > missing?
My point above is not about documentlanguage, it is about @documentencoding. Regarding @documentlanguage, I agree that it could be an interesting option. > If we want collation which uses only codepoints, disregarding any > collation weights defined by the Unicode TR10, we could use > en_US.utf-8, but then, as Gavin says, using glibc collation function > you get more than you asked, because weights are not ignored. So we > need to use something else in the C variant of collation code, AFAIU. Indeed, but I have no idea what to use for now. > > Regarding the language for now the aim was to have something as > > similar as the Perl output, which is obtained without a locale. The > > choice of en_US was motivated by that aim. I looked at the > > /usr/lib/locale/*/LC_COLLATE files on my debian GNU/Linux and there was > > no "en.utf-8", which would have been my first choice, so I used > > "en_US.utf-8". > > I don't know enough about what Perl does in the module you are using. It does Unicode TR10, and we pass an option such that Weighting is set to Non-ignorable. > "Obtained without a locale" means what exactly? a collation order that > only considers the Unicode codepoints of the characters? I mean, in that context, a collation which follows Unicode TR10 with, if possible, Weighting set to Non-ignorable, without language tailoring. > Or does it > mean something else? If it only considers the codepoints, then > collation in C using glibc functions will NOT produce the same order > even under en_US.utf-8, AFAIU. -- Pat