Eryk Sun <eryk...@gmail.com> added the comment:
> Vice versa, using bytes objects cannot represent all file names > on Windows (in the standard mbcs encoding), hence Windows > applications should use string objects to access all files. This is outdated advice that should be removed, or at least reworded to emphasize that the 'mbcs' encoding is only used in legacy mode, with a link to the documentation of sys._enablelegacywindowsfsencoding [1]. Starting in Python 3.6, the default filesystem encoding in Windows is UTF-8. Internally, what happens is that a UTF-8 byte string gets translated to UTF-16 (2 or 4 bytes per character), the native Unicode encoding of the Windows API. A caveat is that Windows filesystems use 16-bit characters that are not restricted to valid Unicode. In particular, ordinals U+D800-U+DFFF are not reserved for use in surrogate pairs. This is "Wobbly" Unicode, and the filesystem encoding thus needs to be "Wobbly Transformation Format, 8-bit" (WTF-8). This is implemented in Python by setting the encode errors handler to "surrogatepass", in contrast to using "surrogateescape" in POSIX. For example, os.fsencode('\ud800') succeeds in Windows but fails in POSIX, while os.fsdecode(b'\x80') fails in Windows but succeeds in POSIX. The latter case is not a practical problem since filesystem functions will never return an invalid WTF-8 byte string. --- [1] https://docs.python.org/3/library/sys.html#sys._enablelegacywindowsfsencoding ---------- components: +Unicode, Windows nosy: +eryksun, ezio.melotti, paul.moore, steve.dower, tim.golden, vstinner, zach.ware _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue43395> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com