[issue43395] os.path states that bytes can't represent all MBCS paths under Windows

Eryk Sun Fri, 05 Mar 2021 02:49:18 -0800

Eryk Sun <eryk...@gmail.com> added the comment:

>  instead of the stated 'surrogatepass'


In Python 3.6 and above, you can check this as follows:

    >>> sys.getfilesystemencoding()
    'utf-8'
    >>> sys.getfilesystemencodeerrors()
    'surrogatepass'

In Python 3.5 and previous:

    >>> sys.getfilesystemencoding()
    'mbcs'

In 3.5, the error handler used by fsencode() and fsdecode() was hard coded as 
'strict' for the 'mbcs' encoding, and otherwise 'surrogateescape'.

> https://docs.python.org/3/library/os.html#os.fsencode
> https://docs.python.org/3/library/os.html#os.fsdecode

The above documentation needs to be updated to reference 
sys.getfilesystemencodeerrors(), as do the doc strings:

    >>> print(textwrap.dedent(os.fsencode.__doc__))

    Encode filename to the filesystem encoding with 'surrogateescape' error
    handler, return bytes unchanged. On Windows, use 'strict' error handler if
    the file system encoding is 'mbcs' (which is the default encoding).

    >>> print(textwrap.dedent(os.fsdecode.__doc__))

    Decode filename from the filesystem encoding with 'surrogateescape' error
    handler, return str unchanged. On Windows, use 'strict' error handler if
    the file system encoding is 'mbcs' (which is the default encoding).

> https://docs.python.org/3/library/os.html#file-names-command-line-arguments-and-environment-variables

This should be rewritten to link to sys.getfilesystemencodeerrors(). I'm fine 
with only discussing the use of "surrogateescape", which is a significant 
concern in POSIX systems, for which it is very easy and common for filenames to 
be created with an arbitrary encoding. 

I don't know if the use of "surrogatepass" in Windows warrants discussion. It 
is uncommon to need the error handler because the filesystem is Unicode. A user 
is unlikely to create a filename with an unpaired surrogate code. 

That said, before Windows 10, the legacy console allowed copying half of a 
surrogate pair to the clipboard, and a program could have a bug that nulls the 
second surrogate code in the pair (e.g. when limiting the length of a 
filename). Anyway, it's technically possible, so we support it. For example, 
"😈" (U+0001F608) is encoded in UTF-16 as the pair (U+D83D, U+DE08). A filename 
could end up with only the first of the two codes:

    >>> open('devil\ud83d', 'w').close()
    >>> print(ascii(os.listdir('.')[0]))
    'devil\ud83d'

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43395>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43395] os.path states that bytes can't represent all MBCS paths under Windows

Reply via email to