Hi Eli.

See below.

Eli Zaretskii <[email protected]> wrote:

> > From: Gavin Smith <[email protected]>
> > Date: Sun, 19 Apr 2026 21:27:12 +0100
> > Cc: [email protected], Werner LEMBERG <[email protected]>
> > 
> > * Here's my current preferred solution, which should work with any awk (gawk
> >   or mawk) regardless of the locale setting, as well as with XeTeX and
> >   LuaTeX (which Werner Lemberg reported problems with in 2022):
> > 
> >   In texinfo.tex, output multibyte UTF-8 sequences with braces around
> >   them in the sort key.
> > 
> >   This works because texindex preserves braced units.
> > 
> > $ cat test.texi
> > \input texinfo
> > 
> > @cindex à gré, césure
> > @cindex écrire des lettres
> > @cindex bbbb
> > 
> > 
> > Index: 
> > @printindex cp
> > 
> > @bye
> > $ cat test.cp
> > @entry{{à} gr{é}, c{é}sure}{1}{à gré, césure}
> > @entry{{é}crire des lettres}{1}{écrire des lettres}
>
> Sorry, I don't understand how this could work in texindex.  The
> sorting in Awk still uses libc functions like strcoll to compare
> strings,

Actually, gawk only uses strcoll() if --posix was supplied.
Otherwise it's strncmp() or strncasecmp().  I don't know of any
awk that does use strcoll(). (Of course, that doesn't mean there
isn't one.)

> The only way I know of to perform these tasks in a way that don't
> require the corresponding locale to be available is to use a library
> that can process non-ASCII text without using libc locale-dependent
> functions.  Emacs, for example, has such a "library" in its own code.
> One alternative is to use something like ICU.  For UTF-8-only
> encoding, we could use Gnulib's libunistring.

Gavin and I discussed a lot of this privately. Because texindex is
intended to work with any awk, the solution isn't in gawk itself.
Trying to force a given locale when running awk based on the document
language and locale is one option, if `locale -a' indicates that there
is a chance for it to work.

If I understand it, wrapping things in braces helps, in that sorting
will then be done on all the bytes in a multibyte-encoded letter
instead of on just the first byte.

I'll let Gavin address the advantages/disadvantages of other options.

Thanks,

Arnold

Reply via email to