Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

Guy Harris Mon, 11 Jul 2011 16:58:37 -0700

On Jul 11, 2011, at 4:00 PM, Stephen Fisher wrote:

> The popular SecureCRT terminal emulator defaults to "default" (same as 
> local system) character encoding, at least on Windows systems.  This is 
> not compatible with UTF-8 in my experience.


Not surprising, given that "default"/"same as local system" probably means 
"local code page".  Win32 first appeared in NT 3.1 in 1993, and Unicode first 
appeared in 1991 (and Microsoft joined the group doing it in 1990, at least 
according to the Wikipedia article), so it could support Unicode from Day One, 
and they could get away with saying "if you want Unicode you have to use the 
Unicode versions of the APIs, and strings are UCS-2 in those versions of the 
APIs", with the legacy "ASCII"/"ANSI" APIs using code pages.  UN*X didn't have 
that advantage, so UN*X systems support Unicode using UTF-8 rather than with 
Shiny New APIs.

So, on Windows, consoles, whether from Microsoft or third parties, probably 
tend to, if not using UCS-2/UTF-16 characters, use the local code page.  For 
what it's worth, the Wikipedia article on the Win32 console:

        http://en.wikipedia.org/wiki/Win32_console

claims that

        Under Windows NT and CE based versions of Windows, the screen buffer 
uses four bytes per character cell: two bytes for character code, two bytes for 
attributes. The character is then encoded a 16-bit subset of Unicode 
(UCS-2).[2] For backward compatibility, the console APIs exist in two versions: 
Unicode and non-Unicode. The non-Unicode versions of APIs can usecode page 
switching to extend the range of displayed characters (but only if TrueType 
fonts are used for the console window, thereby extending the range of codes 
available). Even UTF-8is available as "code page 65001".

At least according to

        http://msdn.microsoft.com/en-us/library/ms683458(v=VS.85).aspx

the device-independent I/O functions ReadFile() and WriteFile() (for UN*X 
folks, think read() and write()) don't support Unicode:

        High-level I/O gives you a choice between the ReadFile and WriteFile 
functions and the ReadConsole and WriteConsole functions. They are identical, 
except for two important differences. The console functions support the use of 
either Unicode characters or the ANSI character set; the file I/O functions do 
not support Unicode. Also, the file I/O functions can be used to access files, 
pipes, and serial communications devices; the console functions can only be 
used with console handles. This distinction is important if an application 
relies on standard handles that may have been redirected.

and I suspect that the C library _read() and _write() functions, and the 
"standard I/O library" functions that are presumably built atop them, probably 
ultimately run atop ReadFile() and WriteFile(), so that they're 
device-independent.

On UN*X, you probably get similar behavior, *mutatis mutandis* (e.g., replacing 
"the system code page setting" with the code set portion of the setting of LANG 
or LC_CTYPE" or whatever), so we can't guarantee, on Windows or UN*X, that what 
gets printed with printf() or fprintf() can always be done in UTF-8, so

        1) we'd have to translate it to the appropriate character encoding

and

        2) not all Unicode characters can necessarily be represented in that 
encoding.

In the best of all possible worlds, all UN*X systems would be configured to use 
UTF-8 encoding and all Windows systems would be configured to use code page 
65001, but....
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev@wireshark.org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-requ...@wireshark.org?subject=unsubscribe

Re: [Wireshark-dev] UTF8 vs. locale in error messages (bug 5715)

Reply via email to