> From: eryk...@gmail.com > Date: Sun, 13 Mar 2016 13:58:46 -0500 > Subject: Re: [Tutor] Changing the interpreter prompt symbol from ">>>" to ??? > To: tutor@python.org > CC: sjeik_ap...@hotmail.com > > On Sun, Mar 13, 2016 at 3:14 AM, Albert-Jan Roskam > <sjeik_ap...@hotmail.com> wrote: > > I thought that utf-8 (cp65001) is by definition (or by design?) impossible > > for console output in windows? Aren't there "w" (wide) versions of functions > > that do accept utf-8? > > The wide-character API works with the native Windows character > encoding, UTF-16. Except the console is a bit 'special'. A surrogate > pair (e.g. a non-BMP emoji) appears as 2 box characters, but you can > copy it from the console to a rich text application, and it renders > normally.
That is very useful to know. > The console also doesn't support variable-width fonts for > mixing narrow and wide (East Asian) glyphs on the same screen. If that > matters, there's a program called ConEmu that hides the console and > proxies its screen and input buffers to drive an improved interface > that has flexible font support, ANSI/VT100 terminal emulation, and > tabs. If you pair that with win_unicode_console, it's almost as good > as a Linux terminal, but the number of hoops you have to go through to > make it all work is too complicated. So windows uses the following (Western locales): console: cp437 (OEM codepage) "bytes": cp1252 (ANSI codepage) unicode: utf-16-le (is 'mbcs' equivalent to utf-16-*?) Sheesh, so much room for errors. Why not everything utf-8, like in linux? Is cmd.exe that impopular that Microsoft does not replace it with something better? I understand that this silly OEM codepage is a historical anomaly, but am I correct in saying that the use of codepages (with stupid differences such as latin-1 vs cp1252 as a bonus) are designed to hamper cross-platform compatibility (and force people to stick with windows)? > Some people try to use UTF-8 (codepage 65001) in the ANSI API -- > ReadConsoleA/ReadFile and WriteConsoleA/WriteFile. But the console's > UTF-8 support is dysfunctional. It's not designed to handle it. > > In Windows 7, WriteFile calls WriteConsoleA, which decodes the buffer > to UTF-16 using the current codepage and returns the number of UTF-16 > 'characters' written instead of the number of bytes. This confuses > buffered writers. Say it writes a 20-byte UTF-8 string with 2 bytes > per character. WriteFile returns that it successfully wrote 10 > characters, so the buffered writer tries to write the last 10 bytes > again. This leads to a trail of garbage text written after every > write. Strange. I would have thought it writes the first 10 bytes (5 characters) and that the remaining 10 bytes end up in oblivion. > When a program reads from the console using ReadFile or ReadConsoleA, > the console's input buffer has to be encoded to the target codepage. > It assumes that an ANSI character is 1 byte, so if you try to read N > bytes, it tries to encode N characters. This fails for non-ASCII > UTF-8, which has 2 to 4 bytes per character. However, it won't > decrease the number of characters to fit in the N byte buffer. In the > API the argument is named "nNumberOfCharsToRead", and they're sticking > to that literally. The result is that 0 bytes are read, which is > interpreted as EOF. So the REPL will quit, and input() will raise > EOFError. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor