Re: multibyte characters in the Info reader

Eli Zaretskii Thu, 15 Jan 2026 09:53:56 -0800

> From: Bruno Haible <[email protected]>
> Cc: [email protected], [email protected]
> Date: Thu, 15 Jan 2026 15:52:26 +0100
> 
> You are right regarding the limited support of UTF-8 locales on native
> Windows. And the problem is not limited to Windows, it's a basic choice
> of programming APIs.
> 
> There are three ways to read files that contain multibyte characters:
> 
>   (A) Use locale-aware functions.
>   (B) Use a specific encoding always (e.g. UTF-8), independently of the 
> locale.
>   (C) Support many encodings, independently of the locale.
> 
> Approach (A) consists of the function mbrtowc(), the macro MB_CUR_MAX, and
> higher layers based on mbrtowc(): mbi_iterator_t etc.
> 
> Since locales are defined by the system (and many systems don't have a POSIX
> compliant 'localedef' utility), this limits the available encodings:
>   - On native Windows with MSVCRT, UTF-8 locales are not supported.
>     Not without Gnulib and not with Gnulib, because we can't support
>     MB_CUR_MAX == 4 if the system supports only MB_CUR_MAX ≤ 2.
>   - On macOS, musl libc, Android, and other platforms, unibyte locales are not
>     supported because all locales use UTF-8 or ASCII.
>   - On AIX, the system supports UTF-8 locales; but if your sysadmin has only
>     installed ISO-8859-1 locales, you are doomed as well.
> 
> Approach (B) consists of using e.g. libunistring with the various u8_* 
> functions.
> This is independent of the locale, but it's only one ASCII-compatible 
> encoding.
> 
> Approach (C) generalizes (B) by supporting several encodings. Up to 10 types
> of encodings can be supported (unibyte, UTF-8, EUC, EUC-JP, EUC-TW, BIG5,
> BIG5-HKSCS, GBK, GB18030, Shift_JIS).
> 
> The code in texinfo/info/display.c uses approach (A), with the limitations
> mentioned above.
> 
> If you want to overcome these limitations, the following questions need
> to be answered first:
>   - Which text encodings can occur in Info files?
>   - Who decides about the text encoding in an Info file?
>   - Are there commands for converting an Info file from one encoding to
>     another (kind of info-iconv)?


What I would like to do is to have full support when the terminal can
support UTF-8 encoding, and delegate the other output encodings to the
system libraries.

Regarding the encoding of the Info file, it is not a serious problem,
because (a) most Info files use UTF-8 anyway, and (b) the Info reader
already includes support for re-encoding other codesets to UTF-8
(provided that the Info reader is build with libiconv).  So the only
case where the encoding of the Info file is relevant is if the Info
reader was built without libiconv.

Your questions above are mostly about encoding of the Info file, but
the actual problem I was talking about is with how text is displayed.
The encoding of the Windows terminal is, of course, only relevant to
how text is displayed.

So what I would like to have is a way of bypassing the Windows CRT
functions, and using Gnulib's code to do the likes of mbrlen and
mbrtowc, but only when the locale's codeset is UTF-8.  (The Windows
port of the Info reader already overrides the locale's codeset with
the codepage returned by GetConsoleOutputCP, so if the user sets that
to codepage 65001, the Info reader will consider the locale's codeset
to be UTF-8.)  Is that possible, with the current Gnulib?

> If you want to use approach (B), it's a different API, documented in the
> GNU libunistring manual [1] and available through Gnulib modules [2].

Is there a way to use (B) when the locale's codeset is UTF-8 and use
(A) otherwise?  Because the Info reader already converts all external
encodings to UTF-8 internally (provided libiconv is available), see
scan.c:copy_converting in the Info sources.

Thanks.

Re: multibyte characters in the Info reader

Reply via email to