https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=16649

--- Comment #10 from Guy Harris <ghar...@sonic.net> ---
I was testing some stuff with the simple program

#define UTF8_PLACE_OF_INTEREST_SIGN     "\xe2\x8c\x98"      /*  8984 / 0x2318
*/
#define UTF8_RIGHTWARDS_ARROW           "\xe2\x86\x92"      /*  8594 / 0x2192
*/

#include <locale.h>
#include <stdio.h>

int
main(void)
{
#ifdef _WIN32
        setlocale(LC_ALL, ".UTF-8");
#endif
        printf("C:\\ " UTF8_PLACE_OF_INTEREST_SIGN "C in macOS "
UTF8_RIGHTWARDS_ARROW " ^C elsewhere\n");
        return 0;
}

(the C:\\ was put there to test the problem mentioned in

    https://dev.to/mattn/please-stop-hack-chcp-65001-27db

which I *suspect* this change won't affect, as we're *not* changing the console
code page, at least from what I could find in the Universal C Runtime source I
have from my W10 VM).

On my Mac, running macOS 10.15.5, compiling and running that test program
produces the output

    C:\ ⌘C in macOS → ^C elsewhere

On my Windows 7 VM, which is updated about as far as I can go, compiling that
test program with MSVC Version 19.16.27034 for x64, and running it, produces
the output

    C:\ ?C in macOS → ^C elsewhere

if the console's code page is CP437 (standard PC code page, which is the
default), but produces the output

    C:\ ⌘

if the console's code page is CP65001 (UTF-8)

On my Windows 10 VM, currently running "Version 1909 (OS Build 18363.900)",
compiling that test program with MSVC Version 19.26.28806 for x86, and running
it, produces the output

    C:\ ?C in macOS  ^C elsewhere

if the code page is CP437 (there's a question mark in a box - probably the
glyph for REPLACEMENT CHARACTER - in the console display, but it copies as a
blank), but produces the output

    C:\ ⌘C in macOS → ^C elsewhere

if the console's code page is CP65001.  (All tests of, and sets of, the
console's code page done with the cpch command.)

So the only place where UTF-8 Works Right with that program, in those tests, is
on Windows 10, and then only if the code page is 65001.

"You are in a twisty little maze of code paths, all different."

>From a quick look at the Universal CRT source, version 10.0.17763.0 (I think
that's the SDK version; it was one of the SDK versions installed on my W10 VM),
it appears that:

* the setlocale() call just sets, for LC_CTYPE, an internal C runtime variable
indicating the code page for the locale;

* if the locale ends with ".UTF-8", that'll be CP_UTF8;

* the standard output and error descriptors (1 and 2 for _read(), _write(),
etc., not stdout and stderr for fprintf(), etc.) have the FTEXT file flag set
and have "textmode" set to __crt_lowio_text_mode::ansi;

* _write() calls to the console (and, thus, fprintf() etc. calls to the
console) get special translation done on the characters;

* if the descriptor has FTEXT set and the current locale isn't the "C" locale,
the special translation is the "double translation";

* if the "textmode" is __crt_lowio_text_mode::ansi, the "double translation" is
from the locale's code page -> UTF-16LE -> the console's code page (as noted,
this is console-specific), with a WriteFile() being done to the output HANDLE;

* if the "textmode" is either __crt_lowio_text_mode::utf16le or
__crt_lowio_text_mode::utf8, the buffer appears to be treated as UTF-16LE in
*either* case, no translation is done, and WriteConsoleW() is used (as noted,
this is console-specific).

Oy.  This is what happens when you do things a bit differently from UN*X and
then try to adapt.
So it looks as if, at least with that version of the Universal CRT, attempting
to write UTF-8 strings to the console will fail if any characters in the string
can't be mapped to a character in the *console's* code page.  (It won't "fail"
in the sense of returning an error, but it'll "fail" in the sense that you
won't get the right character written to the console.)

So that might explain why the PLACE OF INTEREST SIGN character wasn't showing
up if the console's code page is 437 - it's not in CP437.  I'm not sure why
nothing *after* it showed up on W7 when the console's code page was 65001 -
different C runtime?  Different console behavior?  Console doesn't yet
understand PLACE OF INTEREST SIGN?

Now, for W10, the failure to display the RIGHTWARDS_ARROW, which *is* in CP437,
is strange.  It might be a consequence of a bug in Windows Terminal fixed by
this pull request:

    https://github.com/microsoft/terminal/pull/1964

as the rightwards arrow is a "low ASCII" character, 0x1A in CP437:

    https://en.wikipedia.org/wiki/Code_page_437#Character_set

That fix "went out for conhost in insider build 19002", so it's probably not in
the 18363 build that I'm currently running.

The *good* news is that the problems mentioned in

    https://dev.to/mattn/please-stop-hack-chcp-65001-27db

probably don't apply to the fix for this bug, as we're not changing the
console's code page.

However:

1) Unicode characters not in the current console code page won't get displayed
properly, even if the font has them, so the current fix for this bug won't
necessarily give you a correct display;

2) At least with older versions of Windows Terminal (which isn't in
W7/W8/W8.1), you won't get those "low ASCII" characters displayed *even with
CP437*.

-- 
You are receiving this mail because:
You are watching all bug changes.
___________________________________________________________________________
Sent via:    Wireshark-bugs mailing list <wireshark-bugs@wireshark.org>
Archives:    https://www.wireshark.org/lists/wireshark-bugs
Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-bugs
             mailto:wireshark-bugs-requ...@wireshark.org?subject=unsubscribe

Reply via email to