On Fri, Jan 16, 2026 at 10:19:54PM +0200, Eli Zaretskii wrote:
> > Can I clarify that "shown as raw bytes" means that they look like
> > "\302\251", i.e. as backslash escape sequences?
> 
> Actually, even worse: some look like control characters, some (e.g.,
> \200) look like ASCII strings produced to represent non-printable
> characters, i.e, with actual ASCII backslash and 3 octal digits.
> That's because printed_representation uses the locale-aware functions
> from the C runtime, and the locale hasn't been changed to use UTF-8
> (and with the older Windows runtime MSVCRT it cannot be changed in
> principle, because MSVCRT didn't support UTF-8).
> 
> What I wanted to accomplish was simple: have Info interpret the text
> as UTF-8, and output it as UTF-8.  But because the C runtime functions
> like mbrlen and iswprint, which are called by mb_len and mb_isprint,
> don't recognize UTF-8, they return results which get in the way.

It seems like it would be simple to add code to pass through non-ASCII
bytes to the terminal:

diff --git a/info/display.c b/info/display.c
index 4df6a45063..34deae02ef 100644
--- a/info/display.c
+++ b/info/display.c
@@ -501,7 +501,7 @@ printed_representation (mbi_iterator_t *iter, int *delim, 
size_t pl_chars,
 
   text_buffer_reset (&printed_rep);
 
-  if (mb_isprint (mbi_cur (*iter)))
+  if (0 && mb_isprint (mbi_cur (*iter)))
     {
       /* cur.wc gives a wchar_t object.  See mbiter.h in the
          gnulib/lib directory. */
@@ -575,6 +575,35 @@ printed_representation (mbi_iterator_t *iter, int *delim, 
size_t pl_chars,
     }
   else
     {
+      if (1)
+        {
+          unsigned char c = *cur_ptr;
+          if ((c & 0x80) == 0x00)
+            {
+              /* ASCII */
+              *pchars = 1;
+              *pbytes = 1;
+              ITER_SETBYTES (*iter, 1);
+              return cur_ptr;
+            }
+          if ((c & 0xc0) == 0x80)
+            {
+              /* UTF-8 continuation byte. */
+              *pchars = 0;
+              *pbytes = 1;
+              ITER_SETBYTES (*iter, 1);
+              return cur_ptr;
+            }
+          if ((c & 0xc0) == 0xc0)
+            {
+              /* UTF-8 initial byte. */
+              *pchars = 1;
+              *pbytes = 1;
+              ITER_SETBYTES (*iter, 1);
+              return cur_ptr;
+            }
+        }
+
       /* Original byte was not recognized as anything.  Display its octal
          value.  This could happen in the C locale for bytes above 128,
          or for bytes 128-159 in an ISO-8859-1 locale.  Don't output the bytes


This counts the screen width of all Unicode codepoints as 1 column,
which will nearly always be correct.  It should make UTF-8 files display
mostly properly in the MS-Windows terminal that you are using.

We could add an Info variable to customize this behaviour.

  • MinG... Eli Zaretskii
    • ... Bruno Haible via Bug reports for the GNU Texinfo documentation system
      • ... Eli Zaretskii
        • ... Bruno Haible via Bug reports for the GNU Texinfo documentation system
          • ... Eli Zaretskii
            • ... Bruno Haible via Bug reports for the GNU Texinfo documentation system
              • ... Eli Zaretskii
          • ... Gavin Smith
            • ... Eli Zaretskii
              • ... Gavin Smith
                • ... Eli Zaretskii

Reply via email to