Nick Coghlan added the comment:

Note: I created issue 18814 to cover some additional tools for working with 
surrogate escaped strings.

For this issue, we currently have http://docs.python.org/3/howto/unicode.html, 
which aims to be a more comprehensive guide to understanding Unicode issues.

I'm thinking we may want a "Debugging Unicode Errors" document, which defers to 
the existing howto guide for those that really want to understand Unicode, and 
instead focuses on quick fixes for resolving various problems that may present 
themselves.

Application developers will likely want to read the longer guide, while the 
debugging document would be aimed at getting script writers past their 
immediate hurdle, without necessarily gaining a full understanding of Unicode.

The would be for this page to become the top hit for "python surrogates not 
allowed", rather than the current top hit, which is a rejected bug report about 
it (http://bugs.python.org/issue13717).

For example:

================================
What is the meaning of "UnicodeEncodeError: surrogates not allowed"?
--------------------------------------------------------------------

Operating system metadata on POSIX based systems like Linux and Mac OS X may 
include improperly encoded text values. To cope with this, Python uses the 
"surrogateescape" error handler to store those arbitrary bytes inside a Unicode 
object. When converted back to bytes using the same encoding and error handler, 
the original byte sequence is reproduced exactly. This allows operations like 
opening a file based on a directory listing to work correctly, even when the 
metadata is not properly encoded according to the system settings.

The "surrogates not allowed" error appears when a string from one of these 
operating system interfaces contains an embedded arbitrary byte sequence, but 
an attempt is made to encode it using the default "strict" error handler rather 
than the "surrogateescape" handler. This commonly occurs when printing 
improperly encoded operating system data to the console, or writing it to a 
file, database or other serialised interface.

The ``PYTHONIOENCODING`` environment variable can be used to ensure operating 
system metadata can always be read via sys.stdin and written via sys.stdout. 
The following command will display the encoding Python will use by default to 
interact with the operating system::

    $ python3 -c "import sys; print(sys.getfilesystemencoding())"
    utf-8

This can then be used to specify an appropriate setting for 
``PYTHONIOENCODING``:: 


    $ export PYTHONIOENCODING=utf-8:surrogateescape

For other interfaces, there is no such general solution. If allowing the 
invalid byte sequence to propagate further is acceptable, then enabling the 
surrogateescape handler may be appropriate. Alternatively, it may be better to 
track these corrupted strings back to their point of origin, and either fix the 
underlying metadata, or else filter them out early on.
================================

If issue 18814 is implemented, then it could point to those tools. Similarly, 
issue 15216 could be referenced if that is implemented.

----------
assignee:  -> docs@python
components: +Documentation
nosy: +docs@python
title: Enable surrogateescape on stdin and stdout when appropriate -> Clearly 
document the use of PYTHONIOENCODING to set surrogateescape

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18713>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to