On Thu, 25 Aug 2011, Guido van Rossum wrote:

I'm not sure what should happen with UTF-8 when it (in flagrant
violation of the standard, I presume) contains two separately-encoded
surrogates forming a valid surrogate pair; probably whatever the UTF-8
codec does on a wide build today should be good enough. Similarly for
encoding to UTF-8 on a wide build if one managed to create a string
containing a surrogate pair. Basically, I'm for a
garbage-in-garbage-out approach (with separate library functions to
detect garbage if the app is worried about it).

If it's called UTF-8, there is no decision to be taken as to decoder behaviour - any byte sequence not permitted by the Unicode standard must result in an error (although, of course, *how* the error is to be reported could legitimately be the subject of endless discussion). There are security implications to violating the standard so this isn't just legalistic purity.

Hmmm, doesn't look good:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
'\xed\xb0\x80'.decode ('utf-8')
u'\udc00'


Incorrect! Although this is a narrow build - I can't say what the wide build would do.

For reasons of practicality, it may be appropriate to provide easy access to a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be called UTF-8. Other variations may also find use if provided.

See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt

And CESU-8 technical report: http://www.unicode.org/reports/tr26/

Isaac Morland                   CSCF Web Guru
DC 2554C, x36650                WWW Software Specialist
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to