[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

Nick Coghlan Sun, 12 Feb 2012 03:18:56 -0800

Nick Coghlan <[email protected]> added the comment:

If such use cases are indeed better handled as bytes, then that's what should 
be documented. However, there are some text processing assumptions that no 
longer hold when using bytes instead of strings (such as "x[0:1] == x[0]"). You 
also can't safely pass such byte sequences to various other APIs (e.g. 
urllib.parse will happily process surrogate escaped text without corrupting 
them, but will throw UnicodeDecodeError for bytes sequences that aren't pure 
7-bit ASCII).


Using surrogateescape instead means that you're only going to have problems if 
you go to encode the data to an encoding other than the source one. That's 
basically the things work in Python 2 with 8-bit strings.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue13997>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

Reply via email to