[issue26345] Extra newline appended to UTF-8 strings on Windows

2016-02-12 Thread Eryk Sun

Eryk Sun added the comment:

This a third-party problem due to bugs in the console's support for codepage 
65001. For the general problem of Unicode in the console, see issue 1602. The 
best way to resolve this problem is by using the wide-character APIs, 
WriteConsoleW and ReadConsoleW. I suggest that you try the win_unicode_console 
package.

> But if I try to print something a little less common
> (GREEK CAPITAL LETTER ALPHA), something weird happens:
>
>>python -c "print(chr(0x391))"
>Α
>
>
>>

In versions of Windows that use the legacy console, WriteFile to a console 
screen mistakenly returns the number of UTF-16 codes written instead of the 
number of bytes written. 

For example, '\u0391\r\n' gets encoded as a four-byte buffer, b'\xce\x91\r\n'. 
Here's the result of writing this buffer to the legacy console, using codepage 
65001:

>>> sys.stdout.buffer.raw.write(b'\xce\x91\r\n')
Α
3

Four bytes were written, but the console returns that it wrote three UTF-16 
codes. Python's BufferedWriter (i.e. sys.stdout.buffer) sees this as an 
incomplete write. So it writes the last byte again. That's why you see an extra 
newline. The problem can be far worse if the UTF-8 buffer contains many 
non-ASCII characters, especially if it includes codes greater than U+07FF that 
get encoded as three bytes. 

This particular problem is fixed in the new version of the console that comes 
with Windows 10. For the legacy console, you can work around the problem by 
hooking WriteConsoleA and WriteFile via DLL injection. For example, ANSICON and 
ConEmu do this.

That said, there's a far worse problem with using codepage 65001 in the 
console, which still exists in Windows 10. Due to this bug Python's interactive 
REPL will quit whenever you try to enter non-ASCII characters, and built-in 
input() will raise EOFError. For example:

>>> input()
Ü
Traceback (most recent call last):
  File "", line 1, in 
EOFError

To read the console's wide-character (UTF-16) input buffer via ReadFile, it has 
to first get encoded to the current codepage. The console does the conversion 
via WideCharToMultiByte with a buffer size that assumes each UTF-16 value will 
be encoded as a single byte. But that's wrong for UTF-8, in which one UTF-16 
code can map to as many as three bytes. So WideCharToMultiByte fails, but does 
the console try to increase the buffer size? No. Does it fail the call? No. It 
actually returns back that it 'successfully' read 0 bytes. To the REPL and 
built-in input() that signals EOF (end of file).

If you only need to input text in your system locale, you can try to have the 
best of both worlds. Use chcp.com to set the command prompt to the codepage you 
need for input. Then in your Python script (e.g. in sitecustomize.py) you can 
use ctypes to change just the output codepage and rebind sys.stdout. For 
example:

>>> import os, sys, ctypes
>>> ctypes.WinDLL('kernel32').SetConsoleOutputCP(65001)
1
>>> sys.stdout = open(os.dup(sys.__stdout__.fileno()), 'w', 
encoding='cp65001')

>>> sys.stdin.encoding
'cp1252'
>>> input()
Ü
'Ü'
>>> print('\u0391')
Α

Another minor bug is that the console doesn't keep an overlapping window in 
case a UTF-8 sequence gets split across multiple writes (typically due to 
buffering). For example:

>>> exec(r'''
... sys.stdout.buffer.raw.write(b'\xce')
... sys.stdout.buffer.raw.write(b'\x91')
... ''')
��>>>

Since UTF-8 uses up to four bytes per code, the console would have to keep a 
three-byte buffer to handle the case of a split write.

> Look, guys, I know what a mess Unicode handling on Windows is,
> and I'm not even sure it's Python's fault 

Unicode handling is only a mess in the Windows API if you think Unicode is 
synonymous with UTF-8. Windows NT is Unicode down to the lowest levels of the 
kernel, but it's UTF-16 using 16-bit wide characters. Part of the problem is 
that the C and POSIX APIs that are preferred by cross-platform applications are 
byte oriented (e.g. null-terminated char strings), so Unicode support becomes 
synonymous with UTF-8. On Windows this leaves you stuck using the ANSI 
codepage, which unfortunately cannot be set to codepage 65001. Microsoft would 
have to rewrite a lot of code to support UTF-8 in the ANSI API, and they have 
no incentive to pay for that given that they're heavily invested in UTF-16.

--
nosy: +eryksun
resolution:  -> third party
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26345] Extra newline appended to UTF-8 strings on Windows

2016-02-12 Thread STINNER Victor

STINNER Victor added the comment:

I guess that it's yet another example of the bug #1602: "windows console 
doesn't print or input Unicode".

Don't use the Windows console, but use a better console which has a better 
Unicode support. For example, you can play with IDLE :-) (Maybe PowerShell or 
ConEmu ?)
https://conemu.github.io/

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26345] Extra newline appended to UTF-8 strings on Windows

2016-02-11 Thread Egor Tensin

New submission from Egor Tensin:

I've come across an issue of Python 3.5.1 appending an extra newline when 
print()ing non-ASCII strings on Windows.

This only happens when the active "code page" is set UTF-8 in cmd.exe:

>chcp
Active code page: 65001

Now, if I try to print an ASCII character (e.g. LATIN CAPITAL LETTER A), 
everything works fine:

>python -c "print(chr(0x41))"
A

>

But if I try to print something a little less common (GREEK CAPITAL LETTER 
ALPHA), something weird happens:

>python -c "print(chr(0x391))"
Α


>

For another example, let's try to print CYRILLIC CAPITAL LETTER A:

>python -c "print(chr(0x410))"
А


>

This only happens if the current code page is UTF-8 though.
If I change it to something that can represent those characters, everything 
seems to be working fine.
For example, the Greek letter:

>chcp 1252
Active code page: 1253

>python -c "print(chr(0x391))"
Α

>

And the Cyrillic letter:

>chcp 1251
Active code page: 1251

>python -c "print(chr(0x410))"
А

>

This also happens if one tries to print a string with a funny character 
somewhere in it. Sometimes it's even worse:

>python -c "print('Привет!')"
Привет!
�т!


>

Look, guys, I know what a mess Unicode handling on Windows is, and I'm not even 
sure it's Python's fault, I just wanted to make sure I'm not delusional and not 
making stuff up.
Can somebody at least confirm this? Thank you.

I'm using x86-64 version of Python 3.5.1 on Windows 8.1.

--
components: Unicode, Windows
messages: 260153
nosy: Egor Tensin, ezio.melotti, haypo, paul.moore, steve.dower, tim.golden, 
zach.ware
priority: normal
severity: normal
status: open
title: Extra newline appended to UTF-8 strings on Windows
type: behavior
versions: Python 3.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com