[issue35883] Change invalid unicode characters to replacement characters in argv

STINNER Victor Sat, 13 Mar 2021 05:11:04 -0800


STINNER Victor <vstin...@python.org> added the comment:


I wrote PR 24843 to fix this issue. With this fix, os.fsencode(sys.argv[1]) 
returns the original byte sequence as expected.

--

I dislike the replace error handler since it loses information. The PEP 383 
surrogateescape error handler exists to prevent losing information.

The root issue is that Py_DecodeLocale() creates wide characters outside Python 
Unicode valid range: [U+0000; U+10ffff].

On Linux, Py_DecodeLocale() usually calls mbstowcs() of the C library. The 
problem is that the the glibc UTF-8 decoder doesn't respect the RFC 3629, it 
doesn't reject characters outside [U+0000; U+10ffff] range. The following issue 
requests to change the glibc UTF-8 codec to respect the RFC 3629, but it's open 
since 2006:
https://sourceware.org/bugzilla/show_bug.cgi?id=2373

Even if the glibc changes, Python should behave the same on old glibc version.

My PEP modifies Py_DecodeLocale() to check if there are characters outside 
[U+0000; U+10ffff] range and use the surrogateescape error handler in that case.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue35883>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue35883] Change invalid unicode characters to replacement characters in argv

Reply via email to