On Thu, Jan 15, 2026 at 10:33:58PM +0100, Bruno Haible via Bug reports for the GNU Texinfo documentation system wrote: > Eli Zaretskii wrote: > > Regarding the encoding of the Info file, it is not a serious problem, > > because (a) most Info files use UTF-8 anyway, and (b) the Info reader > > already includes support for re-encoding other codesets to UTF-8 > > (provided that the Info reader is build with libiconv). So the only > > case where the encoding of the Info file is relevant is if the Info > > reader was built without libiconv. > > I see. So the problem is reduced to displaying > - (U) UTF-8 text in memory (most frequent case), or > - (L) locale-encoded text in memory (only if no iconv API available).
As I understand, it is case (L) that is handled in info. info attempts to recode Info files to the locale encoding (using the iconv function in libc). It then uses locale-aware functions to process the contents of Info files. It does not make explicit use of UTF-8 in many places. The proposal to use "libiconv" appears to assume that the target encoding is always "UTF-8", which would require a slight change in how info loads Info files: it would not be recoding Info files to the locale encoding, but to UTF-8 always. It wasn't clear to me from this discussion whether people understood that info already uses the iconv function. It would then requiring rewriting the whole program to use libunistring instead of libc functions. (Texinfo already uses a lot of libunistring via gnulib in texi2any, although that part of the code is completely separate from info and uses a separate gnulib checkout.) Eli: what is missing from my understanding of your use case is what is going on in scan.c:copy_converting, when the Info file is first read in. Does conversion of input files to UTF-8, based on the locale, actually happen? Can I clarify that "shown as raw bytes" means that they look like "\302\251", i.e. as backslash escape sequences? If the iteration over codepoints in printed_representation does not work, not recognising non-ASCII UTF-8 sequences even though the terminal supports them, then it would be better to fall back to ASCII substitutes when the file is first read in. This would not be the best but would be better then getting the "\302\251" everywhere. This would mean using the degrade_utf8 function in scan.c. Another possibility is using the //TRANSLIT flag for an encoding passed to iconv (I didn't know about this possibility when I wrote the ASCII degradation code, as it wasn't documented in the libc manual or anywhere else I looked.)
