Re: multibyte characters in the Info reader

Bruno Haible via Bug reports for the GNU Texinfo documentation system Thu, 15 Jan 2026 13:34:54 -0800

Eli Zaretskii wrote:
> Regarding the encoding of the Info file, it is not a serious problem,
> because (a) most Info files use UTF-8 anyway, and (b) the Info reader
> already includes support for re-encoding other codesets to UTF-8
> (provided that the Info reader is build with libiconv).  So the only
> case where the encoding of the Info file is relevant is if the Info
> reader was built without libiconv.


I see. So the problem is reduced to displaying
  - (U) UTF-8 text in memory (most frequent case), or
  - (L) locale-encoded text in memory (only if no iconv API available).

> Your questions above are mostly about encoding of the Info file, but
> the actual problem I was talking about is with how text is displayed.
> The encoding of the Windows terminal is, of course, only relevant to
> how text is displayed.
> 
> So what I would like to have is a way of bypassing the Windows CRT
> functions, and using Gnulib's code to do the likes of mbrlen and
> mbrtowc, but only when the locale's codeset is UTF-8.  ...
> Is that possible, with the current Gnulib?

This is *not* possible if you only use the Gnulib mbrlen, mbrtowc, etc.
functions. Because, as explained earlier, the functions mbrtowc and
MB_CUR_MAX are one API, and Gnulib can't implement MB_CUR_MAX == 4
when the system libraries only support MB_CUR_MAX ≤ 2.

> Is there a way to use (B) when the locale's codeset is UTF-8 and use
> (A) otherwise?

It *is* possible with Gnulib if you use the different APIs:
  - u8_mbtouc etc. for the case (U) above,
  - mbrtowc (or better: mbrtoc32) etc. for the case (L) above.

Upon first sight, this would mean that you would need to duplicate
the logic of the functions 'find_diff', 'display_process_line',
'printed_representation', 'display_update_node_text' for the two cases.

But that would be unmaintainable.

To avoid such code duplication, I can think of two maintainable
approaches:

  * You could declare libiconv a mandatory dependency, e.g. like in
    gettext/DEPENDENCIES:

    GNU libiconv
    + Not needed on systems with glibc and on NetBSD.
      But highly recommended on all other systems.
      Needed for character set conversion of PO files from/to Unicode
      and for the iconv_ostream class of libtextstyle.
    + Homepage:
      https://www.gnu.org/software/libiconv/
    + Download:
      https://ftp.gnu.org/gnu/libiconv/
    + Pre-built package name:
      - On Debian and Debian-based systems: --,
      - On Red Hat distributions: --.
      - Other: https://repology.org/project/libiconv/versions
    + If it is installed in a nonstandard directory, pass the option
      --with-libiconv-prefix=DIR to 'configure'.
    + On mingw, a slim alternative is the 'win-iconv' package version 0.0.8
      from https://github.com/win-iconv/win-iconv .

    This means that case (L) would not occur any more, and you could use
    the libunistring API (u8_mbtouc etc.) throughout display.c.

  * Alternatively, you could create an extended copy of gnulib/lib/mbchar.h,
    defining an abstract "multibyte character" that is UTF-8 encoded in case
    (U) and locale encoded in case (L), i.e. depending on a global variable.
    And then, an equally extended copy of gnulib/lib/mbiter.h, defining the
    iterator over such multibyte characters.

    This way, the dispatch between the two APIs (U) and (L) is in the files
    mbchar.h and mbiter.h, with only minor code duplications in display.c.

Bruno

Re: multibyte characters in the Info reader

Reply via email to