Nick Coghlan added the comment:
Note that the specific case I'm really interested is printing on systems that
are properly configured to use UTF-8, but are getting bad metadata from an OS
API. I'm OK with the idea of *only* changing it for UTF-8 rather than for
arbitrary encodings, as well as restricting it to sys.stdout when the codec
used matches the default filesystem encoding.
To double check the current behaviour, I created a directory to tinker with
this. Filenames were created with the following:
>>> open("ℙƴ☂ℌøἤ".encode("utf-8"), "w")
>>> open("basic_ascii".encode("utf-8"), "w")
>>> b"\xd0\xd1\xd2\xd3".decode("latin-1")
'ÐÑÒÓ'
>>> open(b"\xd0\xd1\xd2\xd3", "w")
That last generates an invalid UTF-8 filename. "ls" actually degrades less
gracefully than I thought, and just prints question marks for the bad file:
$ ls -l
total 0
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:04 ????
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 basic_ascii
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 ℙƴ☂ℌøἤ
Python 2 & 3 both work OK if you just print the directory listing directly,
since repr() happily displays the surrogate escaped string:
$ python -c "import os; print(os.listdir('.'))"
['basic_ascii', '\xd0\xd1\xd2\xd3',
'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4']
$ python3 -c "import os; print(os.listdir('.'))"
['basic_ascii', '\udcd0\udcd1\udcd2\udcd3', 'ℙƴ☂ℌøἤ']
Where it falls down is when you try to print the strings directly in Python 3:
$ python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position
0: surrogates not allowed
While setting the IO encoding produces behaviour closer to that of the native
tools:
$ PYTHONIOENCODING=utf-8:surrogateescape python3 -c "import os; [print(fname)
for fname in os.listdir('.')]"
basic_ascii
����
ℙƴ☂ℌøἤ
On the other hand, setting PYTHONIOENCODING as shown provides an environmental
workaround, and http://bugs.python.org/issue15216 will provide an improved
programmatic workaround (which tools like http://code.google.com/p/pyp/ could
use to configure surrogateescape by default).
So perhaps pursuing #15216 further would be a better approach than selectively
changing the default behaviour? And better documentation for ways to handle the
surrogate escape error when it arises?
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue18713>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com