On Mon, Apr 20, 2026 at 09:43:04AM -0600, [email protected] wrote:
> Eli Zaretskii <[email protected]> wrote:
> 
> > > From: [email protected]
> > > Date: Mon, 20 Apr 2026 07:43:16 -0600
> > > Cc: [email protected], [email protected], [email protected], 
> > > [email protected]
> > > 
> > > If I understand it, wrapping things in braces helps, in that sorting
> > > will then be done on all the bytes in a multibyte-encoded letter
> > > instead of on just the first byte.
> >
> > OK, but that still leaves the question of whether the byte sequence
> > corresponding to {à} sorts before or after {É}, say.  And Gawk
> > determines that by calling locale-dependent libc functions, doesn't
> > it?  Or did you assume LC_ALL=C, which will cause Gawk work with
> > individual bytes?  (And if so, what do other Awks do in that case?)
> 
> Hmmm... First, let's restrict this to gawk. Most other awks don't
> deal in wide characters.
> 
> Gawk only converts to wide strings internally when needed, like for
> length(), index(), and so on.  It looks like, if not ignoring case,
> for == and !=, strcmp() is used on the multibyte strings. Otherwise
> (<, <= etc), memcmp().
> 
> > IOW, why did you say strncmp and not wcscmp?
> 
> wcscmp() isn't used at all in gawk.
> 
> This may be a bug, but maybe not.

AFAIK strcmp on UTF-8 encoded strings will return a value with the same
sign as wcscmp called on an equivalent wide character string, so there is
likely no bug here in gawk.  (This is assuming that the possible character
values are represented in codepoint order in wchar_t, e.g. as an integer
giving the codepoint value.)

As for comparison of "{à}" and "{É}", gawk/texindex would likely
compare them by Unicode codepoint value.
This would need testing to see how good the results were.
It's possible that changes to texindex would be needed to
strip out these braces.  

I've got a dim idea that we could also add to the translation files
like txi-fr.tex for language specific tailoring of sort order.

I'm hopeful we can get fairly good results with changes to texinfo.tex
output and possibly minor adjustments in texindex, while allowing texindex
to run in any locale and processing either by bytes or by UTF-8 character
sequences.

As Arnold says, there is no perfect solution but if we can get an
index sorting that is kind of okay and that texi2pdf doesn't crash
with LuaTeX or XeTeX, then it would be a big improvement.


> 
> It's largely irrelevant for texindex, which has to be portable.
> 
> There's no perfect solution.

Reply via email to