Eryk Sun <eryk...@gmail.com> added the comment:

I think this is a locale configuration problem, in which the locale encoding 
doesn't match the terminal encoding. If so, it can be closed as not a bug.

> export a="中文"

In POSIX, the shell reads "中文" from the terminal as bytes encoded in the 
terminal encoding, which could be UTF-8 or some legacy encoding. The value of 
`a` is set directly as this encoded text. There is no intermediate 
decode/encode stage in the shell. For a child process that decodes the value of 
the environment variable, as Python does, the locale's LC_CTYPE encoding should 
be the same or compatible with the terminal encoding.

> job_name = os.environ['a']
> print(job_name)

In POSIX, sys.stdout.errors, as used by print(), will be "surrogateescape" if 
the default LC_CTYPE locale is a legacy locale -- which in 3.6 is the case for 
the "C" locale, since it's usually limited to 7-bit ASCII. "surrogateescape" is 
also the errors handler for decoding bytes os.environb (POSIX) as text 
os.environ. When decoding, "surrogateescape" handles non-ASCII byte values that 
can't be decoded by translating the value into the reserved surrogate range 
U+DC80 - U+DCFF. When encoding, it translates each surrogate code back to the 
original byte value in the range 0x80 - 0xFF. 

Given the above setup, byte sequences in os.environb that can't be decoded with 
the default LC_CTYPE locale encoding will be surrogate escaped in the decoded 
text  The surrogate-escaped values roundtrip back to bytes when printed, 
presumably as the terminal encoding.

> with open('name.txt', 'w', encoding='utf-8')as fw:
>    fw.write(job_name)

The default errors handler for open() is "strict" instead of "surrogateescape", 
so the surrogate-escaped values in job_name cause the encoding to fail.

> Your code runs for me on Windows

In Windows, Python uses the wide-character (16-bit wchar_t) environment of the 
process for os.environ, and, in 3.6+, it uses the console session's 
wide-character API for console files such as sys.std* when they aren't 
redirected to a pipe or disk file. Conventionally, wide-character strings 
should be valid UTF-16LE text. So getting "中文" from os.environ and printing it 
should 'just work'. The output will even be displayed correctly if the console 
session uses a font that supports "中文", or if it's a pseudoconsole (conpty) 
session that's attached to a terminal that supports automatic font fallback, 
such as Windows Terminal.

----------
components: +IO, Interpreter Core, Library (Lib), Unicode -C API
nosy: +eryksun, ezio.melotti, vstinner

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43576>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to