[issue42707] Python uses ANSI CP for stdio on Windows console instead of using console or OEM CP

2020-12-22 Thread Eryk Sun


Eryk Sun  added the comment:

> When there is no console, stdio should use the default textio 
> encoding that is ANSI for now.

stdin, stdout, and stderr are special and can be special cased because they're 
used implicitly for IPC. They've always been acknowledged as special by the 
existence of PYTHONIOENCODING. I think if Python is going to change its policy 
for standard I/O, along the lines of what I think you've been arguing in favor 
of for months now, it should commit to (almost) consistently using the console 
input and output code pages for the standard I/O encoding in Windows, with 
UTF-8 as the default when there is no console session, and with the exception 
that UTF-8 is used for console files. To get legacy behavior, set 
PYTHONLEGACYWINDOWSSTDIO, which will use the console code pages for console 
standard I/O and otherwise use the process active code page for standard I/O.

The default encoding for open() would still be the process active code page 
from GetACP(), and the recommendation should be for scripts to use an explicit 
`encoding`.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42707] Python uses ANSI CP for stdio on Windows console instead of using console or OEM CP

2020-12-22 Thread Inada Naoki


Inada Naoki  added the comment:

> Okay, and also when GetConsoleCP() fails because there's no console (e.g. 
> python.exe w/ DETACHED_PROCESS creation flag, or pythonw.exe). 

When there is no console, stdio should use the default textio encoding that is 
ANSI for now.

> However, using UTF-8 for the input code page is currently broken in many 
> cases, so it should not be promoted as a recommended solution until Microsoft 
> fixes their broken code (which should have been fixed 20 years ago; it's 
> ridiculous). Legacy console applications rely on ReadFile and ReadConsoleA. 
> Setting the input code page to UTF-8 is limited to reading 7-bit ASCII 
> (ordinals 0-127). Other characters get converted to null bytes.

Regardless when we promote it, people use `chcp 65001` in cmd and 
`[Console]::OutputEncoding = [Text.Encoding]::UTF8` in Power Shell.
In such situation, UTF-8 is the best encoding for pipes and redirected files.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42707] Python uses ANSI CP for stdio on Windows console instead of using console or OEM CP

2020-12-22 Thread Eryk Sun

Eryk Sun  added the comment:

> How about treating only UTF-8 and leave legacy environment as-is?
> * When GetConsoleCP() returns CP_UTF8, use UTF-8 for stdin. 
> Otherwise, use ANSI.

Okay, and also when GetConsoleCP() fails because there's no console (e.g. 
python.exe w/ DETACHED_PROCESS creation flag, or pythonw.exe). 

However, using UTF-8 for the input code page is currently broken in many cases, 
so it should not be promoted as a recommended solution until Microsoft fixes 
their broken code (which should have been fixed 20 years ago; it's ridiculous). 
Legacy console applications rely on ReadFile and ReadConsoleA. Setting the 
input code page to UTF-8 is limited to reading 7-bit ASCII (ordinals 0-127). 
Other characters get converted to null bytes. For example:

>>> kernel32.SetConsoleCP(65001)
1
>>> os.read(0, 10)
ab¡¢£¤cd
b'ab\x00\x00\x00\x00cd\r\n'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42707] Python uses ANSI CP for stdio on Windows console instead of using console or OEM CP

2020-12-22 Thread Inada Naoki


Inada Naoki  added the comment:

I think using Console codepage for stdio is better. But I am afraid about 
breaking existing code.
How about treating only UTF-8 and leave legacy environment as-is?

* When GetConsoleCP() returns CP_UTF8, use UTF-8 for stdin. Otherwise, use ANSI.
* When GetConsoleOutputCP() returns CP_UTF8, use UTF-8 for stdout. Otherwise, 
use ANSI.

This will work nice with PowerShell or cmd with `chcp 65001` in most simple 
cases.

--
nosy: +methane

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42707] Python uses ANSI CP for stdio on Windows console instead of using console or OEM CP

2020-12-22 Thread Eryk Sun

Eryk Sun  added the comment:

> I understand Python should be using reading the current CP (from 
> GetConsoleOutputCP
> or using the default OEM CP, and not assuming ANSI CP for stdio

A while ago I analyzed text encodings used by many of the legacy CLI programs 
in Windows. Some programs hard code using either the ANSI or OEM code page, and 
others use either the console's current input code page or its current output 
code page. In light of the inconsistencies, I think defaulting to ANSI for 
non-console standard I/O is fine.

> There's an IO codepage set on Windows consoles (`chcp` for cmd, 
> `[Console]::InputEncoding; [Console]::OutputEncoding` for PowerShell ;

The CMD shell is a Unicode (UTF-16) application, i.e. it calls wide-character 
system and console I/O functions such as ReadConsoleW() and WriteConsoleW(). It 
still uses the console output code page, but as a kind of locale encoding. For 
example, CMD uses the *output* code page when reading a batch file as well as 
when reading output from an external command in a `FOR /F` loop. If Python were 
only concerned with satisfying a `FOR /F` loop in CMD, then it would be 
reasonable to make stdout default to the console output code page. But 
"more.com" and "find.exe" are commonly used as well, and they decode piped 
input using the console *input* code page. Other commands such as "findstr.exe" 
use OEM.

PowerShell adds a spin to this problem. In CMD, piping bytes between two 
processes doesn't actively involve the shell. It just creates an anonymous 
pipe, with each process connected to either end. In contrast, PowerShell 
injects itself as a middle man. For example, piping between "python.exe" and 
"more.com" is implemented as a pipe from "python.exe" to PowerShell and a 
separate pipe from PowerShell to "more.com". In between, PowerShell decodes the 
output from "python.exe" using its current output encoding and then re-encodes 
it using its current input encoding before writing to "more.com".

> # If we adjust cmd CP, it's fine too:
> L:\Cop>chcp 1252
> Page de codes active : 1252
> L:\Cop>py testcp.py | more
> é

In this case, the ANSI code-page encoded output from Python is written to a 
pipe that's read directly by "more.com". In turn, "more.com" decodes the input 
bytes using the console input code page before writing UTF-16 text to the 
console via WriteConsoleW(). 

To make Python use the console input code page for standard I/O, query the code 
page via "chcp.com", and set PYTHONIOENCODING. For example:

C:\>chcp
Active code page: 437
C:\>set PYTHONIOENCODING=cp437
C:\>py -c "print('é')" | more
é

It would be convenient to support encodings that are based on the current 
console code pages, maybe named "conin" and "conout", based on GetConsoleCP() 
and GetConsoleOutputCP(). For example:

C:\>set PYTHONIOENCODING=conin

They could default to the process active code page from GetACP() when there's 
no console. "ansi" and "oem" are already supported, so all four of the common 
encoding abstractions would be supported.

> when there's redirection or piping, encoding falls back to ANSI CP 
> (from config_get_locale_encoding).

The default encoding for files is locale.getpreferredencoding(), unless UTF-8 
mode is enabled. In Windows, this is the process active code page, as returned 
by WinAPI GetACP(). By default, this is the system ANSI code page.

Standard I/O isn't excepted from this, unless either PYTHONIOENCODING is set or 
it's a console device file. The default, non-legacy behavior for console files 
is to use UTF-8 at the buffer and raw I/O level. Internally, Python uses the 
wide-character console I/O functions ReadConsoleW() and WriteConsoleW(), with 
UTF-16 encoded text.

Windows 10 allows setting the system ANSI code page to UTF-8. It also allows an 
application to override its active code page to UTF-8, but that's not easy to 
change. It requires adding an "activeCodePage" setting to the manifest that's 
embedded in the executable, which can be done using the manifest tool, "mt.exe".

--
nosy: +eryksun

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42707] Python uses ANSI CP for stdio on Windows console instead of using console or OEM CP

2020-12-21 Thread Alexey Izbyshev


Alexey Izbyshev  added the comment:

> I've been struggling to understand today why a simple file redirection 
> couldn't work properly today (encoding issues)

The core issue is that "working properly" is not defined in general when we're 
talking about piping/redirection, as opposed to the console. Different programs 
that consume Python's output (or produce its input) can have different 
expectations wrt. data encoding, and there is no way for Python to know it in 
advance. In your examples, you use programs like "more" and "type" to print the 
Python's output back to the console, so in this case using the OEM code page 
would produce the result that you expect. But, for example, in case Python's 
output was to be consumed by a C program that uses simple 
`fopen()/wscanf()/wprintf()` to work with text files, the ANSI code page would 
be appropriate because that's what the Microsoft C runtime library defaults to 
for wide character operations.

Python has traditionally used the ANSI code page as the default IO encoding for 
non-console cases (note that Python makes no distinction between non-console 
`sys.std*` and the builtin `open()` wrt. encoding), and this behavior can't be 
changed. You can use `PYTHONIOENCODING` or enable the UTF-8 mode[1] to change 
the default encoding.

Note that in your example you could simply use `PYTHONIOENCODING=cp850`, which 
would remove the need to use `chcp`.

[1] https://docs.python.org/3/using/cmdline.html#envvar-PYTHONUTF8

--
nosy: +izbyshev, vstinner
versions: +Python 3.10, Python 3.8, Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42707] Python uses ANSI CP for stdio on Windows console instead of using console or OEM CP

2020-12-21 Thread Alexandre

New submission from Alexandre :

Hello, 

first of all, I hope this was not already discussed (I searched the bugs but it 
might have been discussed elsewhere) and it's really a bug.

I've been struggling to understand today why a simple file redirection couldn't 
work properly today (encoding issues) and I think I finally understand the 
whole thing.

There's an IO codepage set on Windows consoles (`chcp` for cmd, 
`[Console]::InputEncoding; [Console]::OutputEncoding` for PowerShell ; chcp 
will not work on Powershell while it displays it set the CP), 850 for my locale.
When there's no redirection / piping, PyWindowsConsoleIO take cares of the 
encoding (utf-8 is seems), but when there's redirection or piping, encoding 
falls back to ANSI CP (from config_get_locale_encoding).

This behavior seems to be incorrect / breaking things, an example:
* testcp.py (file encoded as utf-8)
```
#!/usr/bin/env python3
# -*- coding: utf-8

print('é')
```

* using cmd:
```
# Test condition
L:\Cop>chcp
Page de codes active : 850

# We're fine here
L:\Cop>py testcp.py
é
L:\Cop>py -c "import sys; print(sys.stdout.encoding)"
utf-8

# Now with piping
L:\Cop>py -c "import sys; print(sys.stdout.encoding)" | more
cp1252

L:\Cop>py testcp.py | more
Ú
L:\Cop>py testcp.py > lol && type lol
Ú

# If we adjust cmd CP, it's fine too:
L:\Cop>chcp 1252
Page de codes active : 1252
L:\Cop>py testcp.py | more
é
```

* with pwsh:
```
PS L:\Cop> ([Console]::InputEncoding, [Console]::OutputEncoding) | select 
CodePage

CodePage

 850
 850

# Fine without redirection
PS L:\Cop> py .\testcp.py
é

# Here, write-host expect cp850
PS L:\Cop> py .\testcp.py | write-output
Ú
# Same with Out-file (used by ">")
PS L:\Cop> py .\testcp.py > lol; Get-Content lol
Ú

# 
PS L:\Cop> py .\testcp.py | more
Ú
```

By reading some sources today to solve my issue, I found many solutions:
* in PS `[Console]::OutputEncoding = [Text.Utf8Encoding]::new($false); 
$env:PYTHONIOENCODING="utf8"` or `[Console]::OutputEncoding = 
[Text.Encoding]::GetEncoding(1252)`
* in CMD `chcp 65001 && set PYTHONIOENCODING=utf8` (but this seems to break 
more) or `chcp 1252`

But reading (and trusting) 
https://serverfault.com/questions/80635/how-can-i-manually-determine-the-codepage-and-locale-of-the-current-os
 
(https://docs.microsoft.com/en-us/windows/win32/intl/locale-idefault-constants),
 I understand Python should be using reading the current CP (from 
GetConsoleOutputCP, like 
https://github.com/python/cpython/blob/3.9/Python/fileutils.c:) or using the 
default OEM CP, and not assuming ANSI CP for stdio : 
> * the OEM code page for use by legacy console applications,
> * the ANSI code page for use by legacy GUI applications.

The init path I could trace : 
> https://github.com/python/cpython/blob/3.9/Python/pylifecycle.c
> init_sys_streams
>> create_stdio 
>> (https://github.com/python/cpython/blob/3.9/Python/pylifecycle.c#L1774)
>>> open.raw : 
>>> https://github.com/python/cpython/blob/3.9/Modules/_io/_iomodule.c#L374
 https://github.com/python/cpython/blob/3.9/Modules/_io/winconsoleio.c
>> fallback to ini_sys_stream encoding
> https://github.com/python/cpython/blob/3.9/Python/initconfig.c
> config_init_stdio_encoding
> config_get_locale_encoding
> GetACP()

Some test with GetConsoleCP:
```
L:\Cop>py -c "import os; print(os.device_encoding(0), os.device_encoding(1))" | 
more
cp850 None

L:\Cop>type nul | py -c "import os; print(os.device_encoding(0), 
os.device_encoding(1))"
None cp850

L:\Cop>type nul | py -c "import ctypes; 
print(ctypes.windll.kernel32.GetConsoleCP(), 
ctypes.windll.kernel32.GetConsoleOutputCP())"
850 850

L:\Cop>py -c "import ctypes; print(ctypes.windll.kernel32.GetConsoleCP(), 
ctypes.windll.kernel32.GetConsoleOutputCP())" | more
850 850
```

Some links / documentations, if useful:
* 
https://serverfault.com/questions/80635/how-can-i-manually-determine-the-codepage-and-locale-of-the-current-os
* https://docs.microsoft.com/en-us/windows/win32/intl/locale-idefault-constants
* https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getoemcp
* https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp
* https://docs.microsoft.com/en-us/windows/console/getconsoleoutputcp
* 
https://stackoverflow.com/questions/56944301/why-does-powershell-redirection-change-the-formatting-of-the-text-content
* 
https://stackoverflow.com/questions/19122755/output-echo-a-variable-to-a-text-file
* 
https://stackoverflow.com/questions/40098771/changing-powershells-default-output-encoding-to-utf-8
* Maybe related: https://github.com/PowerShell/PowerShell/issues/10907
* 
https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window
 (will probably break things :) )
* 
https://stackoverflow.com/questions/49476326/displaying-unicode-in-powershell/49481797#49481797
* 
https://stackoverflow.com/questions/25642746/how-do-i-pipe-unicode-into-a-native-application-in-powershell