[issue33780] [subprocess] Better Unicode support for shell=True on Windows

2021-03-15 Thread Eryk Sun


Eryk Sun  added the comment:

The complexity of mixing standard I/O from the shell and external programs is a 
limitation of the Windows command line. Each program could choose to use the 
system (or process) ANSI or OEM code page, the console session's input or 
output code page, UTF-8, or UTF-16. There's no uniform way to enforce one, 
consistent choice. So I'm closing this issue as a third party limitation that 
cannot be addressed in general. The problem has to be handled on a case by case 
basis.

--
resolution:  -> third party
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33780] [subprocess] Better Unicode support for shell=True on Windows

2018-06-17 Thread Yoni Rozenshein


Yoni Rozenshein  added the comment:

After reading your messages and especially after reading 
https://bugs.python.org/issue27179#msg267091 I admit I have been convinced this 
is much more complicated than I thought, and maybe more of a Windows bug than a 
Python bug :)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33780] [subprocess] Better Unicode support for shell=True on Windows

2018-06-06 Thread Eryk Sun


Eryk Sun  added the comment:

> By default, the output of cmd is encoded with the "active" 
> codepage. In Python 3.6, you can decode this using 
> encoding='oem'.

FYI, the actual encoding is not necessarily "oem".

The console codepage may have been changed from the initial value by a 
SetConsoleCP call in the current process or another process (e.g. chcp.com, 
mode.com). For example, a batch script can switch to codepage 65001 to allow 
CMD to read a UTF-8 encoded batch file; or read UTF-8 from an external command 
in a `for /f` loop; or write UTF-8 to a disk file or pipe. 

(Only switch to codepage 65001 temporarily. Using UTF-8 for legacy console I/O 
is buggy. CMD, PowerShell, and Python 3.6+ aren't affected since they use the 
wide-character API for console I/O. But a legacy console application that uses 
the codepage implicitly with ReadFile and WriteFile for byte-based I/O may get 
invalid results such as reading a non-ASCII character as NUL, or the entire 
read failing, or writing garbage to the console after output that contains 
non-ASCII characters.)

To accommodate applications that use the current console codepage for standard 
I/O, Python could add two encodings that correspond to the current value of 
GetConsoleCP and GetConsoleOutputCP (e.g. named "conin" and "conout"). 

Additionally, we can't assume the console codepage is initially OEM. It depends 
on settings in the registry or the shell shortcut for the application that 
allocated the console. In particular, if a new console window is allocated by a 
process (either explicitly via AllocConsole or implicitly for a console app 
that either hasn't inherited a console or was created with the 
CREATE_NEW_CONSOLE or CREATE_NO_WINDOW creation flag), then the console loads 
custom settings from either the registry key "HKCU\Console\" or 
the shell shortcut (LNK file) that started the application. 

If the console uses the window-title registry key, it looks for a "CodePage" 
DWORD value. The key name is the normalized window title, which comes from the 
WindowTitle field of the process parameters. This can be set explicitly using 
the STARTUPINFO lpTitle field that's passed to CreateProcess. Otherwise the 
system uses the executable path as the default window title. The console 
normalizes the title string to create a valid key name by replacing backslash 
with underscore, and it also substitutes "%SystemRoot%" for the Windows 
directory, e.g. the default configuration key for CMD is 
"HKCU\Console\%SystemRoot%_system32_cmd.exe". 

The codepage can also be set in a shell shortcut (LNK file) [1]. When an 
application is started from a shell shortcut, the shell sets the STARTUPINFO 
flag STARTF_TITLEISLINKNAME and the lpTitle string to the fully-qualified path 
of the LNK file. In this case, the console reads the LNK file to load its 
settings, rather than using the window-title subkey in the registry. But the 
"HKCU\Console" root key is still used for the default settings.

Finally, if CMD is run without a console (i.e. using the DETACHED_PROCESS 
creation flag), the default codepage is ANSI, not OEM. This isn't hard-coded in 
CMD. It happens that GetConsoleCP returns 0 (i.e. CP_ACP) in this case.

[1]: https://msdn.microsoft.com/en-us/library/dd891330.aspx

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33780] [subprocess] Better Unicode support for shell=True on Windows

2018-06-06 Thread Eryk Sun


Eryk Sun  added the comment:

> To get the correct output, cmd has a "/u" switch (this switch has 
> probably existed forever - at least since Windows NT 4.0, by my 
> internet search). The output can then be decoded using 
> encoding='utf-16-le', like any native Windows string.

However, the /u switch doesn't affect how CMD reads from stdin when it's a disk 
file or pipe. For example, `set /p` will stop at the first NUL byte. In general 
this is mismatched with subprocess, which provides a single `encoding` value 
for all 3 standard I/O streams. For example:

>>> r = subprocess.run('cmd /d /v /u /c "set /p spam= & echo !spam!"',
... capture_output=True, input='spam', encoding='oem')
>>> r.stdout
's\x00p\x00a\x00m\x00\n\x00\n\x00'

With UTF-16 input, CMD only reads up to "s" instead of reading the entire 
"s\x00p\x00a\x00m\x00" string that was written to stdin:

>>> r = subprocess.run('cmd /d /v /u /c "set /p spam= & echo !spam!"',
... capture_output=True, input='spam', encoding='utf-16le')
>>> r.stdout
's\n'

> 1. A new argument to Popen

This may lead to confusion and false bug reports by people who expect the 
setting to also affect external programs run via the shell (e.g. tasklist.exe). 
It's also not consistent with how CMD reads from stdin, as shown above. 

I can see the use of adding a cross-platform get_shell_path() function that 
returns the fully-qualified path to the shell that's used by shell=True. This 
way programs don't have to figure it out on their own if they need custom shell 
options. 

Common CMD shell options in Windows include /d (skip AutoRun commands), /v 
(enable delayed expansion of environment variables via "!"), /e (enable command 
extensions), /k (remain running after the command), and /u. I'd prefer 
subprocess to use /d by default. It's strange that the CRT's system() command 
doesn't use it.

Currently the shell path can be "/bin/sh" or "/system/bin/sh" in POSIX and 
os.environ.get("COMSPEC", "cmd.exe") in Windows. I'd prefer that Windows 
instead used:

shell_path = os.path.abspath(os.environ.get('ComSpec',
os.path.join(_winapi.GetSystemDirectory(), 'cmd.exe')))

i.e. never use an unqualified, relative path such as "cmd.exe". 

Instead of the single-use GetSystemDirectory function, it could instead use 
_winapi.SHGetKnownFolderPath(_winapi.FOLDERID_System), or 
_winapi.SHGetKnownFolderPath('{1AC14E77-02E7-4E5D-B744-2EB1AE5198B7}') if the 
GUID constants aren't added.

--
nosy: +eryksun

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33780] [subprocess] Better Unicode support for shell=True on Windows

2018-06-06 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
components: +Windows
nosy: +giampaolo.rodola, paul.moore, steve.dower, tim.golden, vstinner, 
zach.ware

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33780] [subprocess] Better Unicode support for shell=True on Windows

2018-06-06 Thread Yoni Rozenshein


New submission from Yoni Rozenshein :

In subprocess, the implementation of shell=True on Windows is to launch a 
subprocess with using {comspec} /c "{args}" (normally comspec=cmd.exe).

By default, the output of cmd is encoded with the "active" codepage. In Python 
3.6, you can decode this using encoding='oem'.

However, this actually loses information. For example, try creating a file with 
a filename in a language that is not your active codepage, and then doing 
subprocess.check_output('dir', shell=True). In the output, the filename is 
replaced with question marks (not by Python, by cmd!).

To get the correct output, cmd has a "/u" switch (this switch has probably 
existed forever - at least since Windows NT 4.0, by my internet search). The 
output can then be decoded using encoding='utf-16-le', like any native Windows 
string.

Currently, Popen constructs the command line in this hardcoded format: 
{comspec} /c "{args}", so you can't get the /u in there with the shell=True 
shortcut, and have to write your own wrapping code.

I suggest adding an feature to Popen where /u may be inserted before the /c 
within the shell=True shortcut. I've thought of several ways to implement this:

1. A new argument to Popen, which indicates that we want Unicode shell output; 
if True, add the /u. Note that we already have a couple of Windows-only 
arguments to Popen, so this would not be a precedent.

2. If the encoding argument is 'utf-16-le' or one of its aliases, then add the 
/u.

3. If the encoding argument is not None, then add the /u.

--
components: Library (Lib)
messages: 318807
nosy: Yoni Rozenshein
priority: normal
severity: normal
status: open
title: [subprocess] Better Unicode support for shell=True on Windows
type: enhancement
versions: Python 3.6, Python 3.7, Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com