Marc-Andre Lemburg <m...@egenix.com> added the comment: STINNER Victor wrote: > > STINNER Victor <victor.stin...@haypocalc.com> added the comment: > >> I also found out that, according to RFC 3629, surrogates >> are considered invalid and they can't be encoded/decoded, >> but the UTF-8 codec actually does it. > > Python2 does, but Python3 raises an error. > > Python 2.7a4+ (trunk:79675, Apr 3 2010, 16:11:36) >>>> u"\uDC80".encode("utf8") > '\xed\xb2\x80' > > Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55) >>>> "\uDC80".encode("utf8") > UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position > 0: surrogates not allowed > > Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because > most functions calling suppose that _PyUnicode_AsString() does never fail: > see #6687 (and #8195 and a lot of other crashs). It's not a good idea to > change it in Python 2.7, because it would require a huge work and we are > close to the first beta of 2.7.
I wonder how that change got into the 3.x branch - I would certainly not have approved it for the reasons given further up on this ticket. I think we should revert that change for Python 3.2. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue8271> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com