On Tue, Mar 15, 2016 at 2:51 PM, Albert-Jan Roskam <sjeik_ap...@hotmail.com> wrote: > > So windows uses the following (Western locales): > console: cp437 (OEM codepage) > "bytes": cp1252 (ANSI codepage)
The console defaults to the OEM codepage, but you can separately switch the input and output to different codepages. This is an exception to the rule, as otherwise the system codepage used in the [A]NSI API is fixed when the system boots. Changing it requires modifying the system locale and rebooting. > unicode: utf-16-le (is 'mbcs' equivalent to utf-16-*?) The native Unicode encoding of Windows is UTF-16LE. This is what gets used in the kernel, device drivers, and filesystems. UTF-16 was created to accommodate the early adopters of 16-bit Unicode, such as Windows. When you call an ANSI API, such as CreateFileA, the bytes argument(s) get decoded to UTF-16, and then it calls the corresponding wide-character function, such as CreateFileW (or maybe a common internal function, but that's an implementation detail). ANSI is a legacy API, and it's moving towards deprecation and obsolescence. New WinAPI functions are often created with only wide-character support. MBCS (multibyte character set) refers to encoding that can be used in the system locale for the [A]NSI API. While technically UTF-8 and UTF-16 are multibyte encodings, they're not allowed in the legacy ANSI API. That said, because the console is just plain weird, it allows setting its input and output codepages to UTF-8, even though the result is often buggy. > Sheesh, so much room for errors. Why not everything utf-8, like in linux? NT was developed before UTF-8 was released, so you're asking the NT team to invent a time machine. Plus there's nothing really that horrible about UTF-16. On the plus side, it uses only 2 bytes per character for all characters in the BMP, whereas UTF-8 uses 3 bytes per character for 61440 out of 63488 characters in the BMP (not including the surrogate-pair block, U+D800-U+DFFF). On the plus side for UTF-8, it encodes ASCII (i.e. ordinals less than 128) in a single byte per character. > Is cmd.exe that impopular that Microsoft does not replace it with something > better? cmd.exe is a shell, like powershell.exe or bash.exe, and a console client application just like python.exe. cmd.exe doesn't host the console, nor does it have anything to do with the console subsystem other than being a client of it. When you run python.exe from cmd.exe, all cmd does is wait for python to exit. When you run a program that's flagged as a console application, the system either attaches an inherited console if one exists, or opens and attaches a new console. This window is hosted by an instance of conhost.exe. It also implements the application's command-line editing, input history buffer (e.g. F7), and input aliases, separately for each attached executable. This part of the console API is accessible from the command line via doskey.exe. cmd.exe is a Unicode application, so codepages aren't generally of much concern to it, except it defaults to encoding to the console codepage when its built-in commands such as "dir" and "set" are piped to another program. You can force it to use UTF-16 in this case by running cmd /U. This is just for cmd's internal commands. What external commands write to a pipe is up to them. For example, Python 3 defaults to using the ANSI codepage when stdio is a pipe. You can override this via the environment variable PYTHONIOENCODING. As to replacing the console, I doubt that will happen. Microsoft has little incentive to invest in improving/replacing the console and command-line applications. Windows administration has shifted to PowerShell scripting and cmdlets. > am I correct in saying that the use of codepages (with stupid differences > such as latin-1 vs cp1252 as a bonus) are designed to hamper cross- > platform compatibility (and force people to stick with windows)? The difference between codepage 1252 and Latin-1 is historical. Windows development circa 1990 was following a draft ANSI standard for character encodings, which later became the ISO 8859-x encodings. Windows codepages ended up deviating from the ISO standard. Plus the 'ANSI' API also supports MBCS codepages for East-Asian languages. I can't speak to any nefarious business plans to hamper cross-platform compatibility. But it seems to me that this is a matter of rushing to get a product to market without having time to wait for a standards committee, plus a good measure of arrogance that comes from having market dominance. > Strange. I would have thought it writes the first 10 bytes (5 characters) > and that the remaining 10 bytes end up in oblivion. Maybe a few more details will help to clarify the matter. In Windows 7, WriteFile basically calls WriteConsoleA, which makes a local procedure call (LPC) to SrvWriteConsole in conhost.exe. This call is flagged as to whether the buffer is ANSI or Unicode (UTF-16). If it's ANSI, the console first decodes the buffer using MultiByteToWideChar according to the output screen's codepage. Then it copies the decoded buffer to the output screen buffer and returns to the caller how many UTF-16 *code points* it wrote. Maybe that's fine for WriteConsoleA, but returning that number to a WriteFile caller (a bytes API) is nonsense. If I write a 20-byte buffer, I need to know that all 20 bytes were written, not that 10 UTF-16 codes were written. This causes problems with buffered writers such as Python 3's BufferedWriter class. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor