On Thu, 25 Aug 2011, Guido van Rossum wrote:
I'm not sure what should happen with UTF-8 when it (in flagrant violation of the standard, I presume) contains two separately-encoded surrogates forming a valid surrogate pair; probably whatever the UTF-8 codec does on a wide build today should be good enough. Similarly for encoding to UTF-8 on a wide build if one managed to create a string containing a surrogate pair. Basically, I'm for a garbage-in-garbage-out approach (with separate library functions to detect garbage if the app is worried about it).
If it's called UTF-8, there is no decision to be taken as to decoder behaviour - any byte sequence not permitted by the Unicode standard must result in an error (although, of course, *how* the error is to be reported could legitimately be the subject of endless discussion). There are security implications to violating the standard so this isn't just legalistic purity.
Hmmm, doesn't look good: Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information.
'\xed\xb0\x80'.decode ('utf-8')
u'\udc00'
Incorrect! Although this is a narrow build - I can't say what the wide build would do.
For reasons of practicality, it may be appropriate to provide easy access to a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be called UTF-8. Other variations may also find use if provided.
See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt And CESU-8 technical report: http://www.unicode.org/reports/tr26/ Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com