Hi Eli. See below.
Eli Zaretskii <[email protected]> wrote: > > From: Gavin Smith <[email protected]> > > Date: Sun, 19 Apr 2026 21:27:12 +0100 > > Cc: [email protected], Werner LEMBERG <[email protected]> > > > > * Here's my current preferred solution, which should work with any awk (gawk > > or mawk) regardless of the locale setting, as well as with XeTeX and > > LuaTeX (which Werner Lemberg reported problems with in 2022): > > > > In texinfo.tex, output multibyte UTF-8 sequences with braces around > > them in the sort key. > > > > This works because texindex preserves braced units. > > > > $ cat test.texi > > \input texinfo > > > > @cindex à gré, césure > > @cindex écrire des lettres > > @cindex bbbb > > > > > > Index: > > @printindex cp > > > > @bye > > $ cat test.cp > > @entry{{à} gr{é}, c{é}sure}{1}{à gré, césure} > > @entry{{é}crire des lettres}{1}{écrire des lettres} > > Sorry, I don't understand how this could work in texindex. The > sorting in Awk still uses libc functions like strcoll to compare > strings, Actually, gawk only uses strcoll() if --posix was supplied. Otherwise it's strncmp() or strncasecmp(). I don't know of any awk that does use strcoll(). (Of course, that doesn't mean there isn't one.) > The only way I know of to perform these tasks in a way that don't > require the corresponding locale to be available is to use a library > that can process non-ASCII text without using libc locale-dependent > functions. Emacs, for example, has such a "library" in its own code. > One alternative is to use something like ICU. For UTF-8-only > encoding, we could use Gnulib's libunistring. Gavin and I discussed a lot of this privately. Because texindex is intended to work with any awk, the solution isn't in gawk itself. Trying to force a given locale when running awk based on the document language and locale is one option, if `locale -a' indicates that there is a chance for it to work. If I understand it, wrapping things in braces helps, in that sorting will then be done on all the bytes in a multibyte-encoded letter instead of on just the first byte. I'll let Gavin address the advantages/disadvantages of other options. Thanks, Arnold
