Eryk Sun <[email protected]> added the comment:
> Vice versa, using bytes objects cannot represent all file names
> on Windows (in the standard mbcs encoding), hence Windows
> applications should use string objects to access all files.
This is outdated advice that should be removed, or at least reworded to
emphasize that the 'mbcs' encoding is only used in legacy mode, with a link to
the documentation of sys._enablelegacywindowsfsencoding [1].
Starting in Python 3.6, the default filesystem encoding in Windows is UTF-8.
Internally, what happens is that a UTF-8 byte string gets translated to UTF-16
(2 or 4 bytes per character), the native Unicode encoding of the Windows API.
A caveat is that Windows filesystems use 16-bit characters that are not
restricted to valid Unicode. In particular, ordinals U+D800-U+DFFF are not
reserved for use in surrogate pairs. This is "Wobbly" Unicode, and the
filesystem encoding thus needs to be "Wobbly Transformation Format, 8-bit"
(WTF-8). This is implemented in Python by setting the encode errors handler to
"surrogatepass", in contrast to using "surrogateescape" in POSIX. For example,
os.fsencode('\ud800') succeeds in Windows but fails in POSIX, while
os.fsdecode(b'\x80') fails in Windows but succeeds in POSIX. The latter case is
not a practical problem since filesystem functions will never return an invalid
WTF-8 byte string.
---
[1]
https://docs.python.org/3/library/sys.html#sys._enablelegacywindowsfsencoding
----------
components: +Unicode, Windows
nosy: +eryksun, ezio.melotti, paul.moore, steve.dower, tim.golden, vstinner,
zach.ware
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue43395>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com