[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Marc-Andre Lemburg Wed, 07 Apr 2010 01:42:32 -0700

Marc-Andre Lemburg <m...@egenix.com> added the comment:

STINNER Victor wrote:
> 
> STINNER Victor <victor.stin...@haypocalc.com> added the comment:
> 
>> I also found out that, according to RFC 3629, surrogates 
>> are considered invalid and they can't be encoded/decoded, 
>> but the UTF-8 codec actually does it.
> 
> Python2 does, but Python3 raises an error.
> 
> Python 2.7a4+ (trunk:79675, Apr  3 2010, 16:11:36)
>>>> u"\uDC80".encode("utf8")
> '\xed\xb2\x80'
> 
> Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55)
>>>> "\uDC80".encode("utf8")
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 
> 0: surrogates not allowed
> 
> Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because 
> most functions calling suppose that _PyUnicode_AsString() does never fail: 
> see #6687 (and #8195 and a lot of other crashs). It's not a good idea to 
> change it in Python 2.7, because it would require a huge work and we are 
> close to the first beta of 2.7.


I wonder how that change got into the 3.x branch - I would certainly
not have approved it for the reasons given further up on this ticket.

I think we should revert that change for Python 3.2.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to