Re: [Python-Dev] PEP 393 Summer of Code Project

Isaac Morland Thu, 25 Aug 2011 19:45:06 -0700

On Thu, 25 Aug 2011, Guido van Rossum wrote:

I'm not sure what should happen with UTF-8 when it (in flagrant
violation of the standard, I presume) contains two separately-encoded
surrogates forming a valid surrogate pair; probably whatever the UTF-8
codec does on a wide build today should be good enough. Similarly for
encoding to UTF-8 on a wide build if one managed to create a string
containing a surrogate pair. Basically, I'm for a
garbage-in-garbage-out approach (with separate library functions to
detect garbage if the app is worried about it).

If it's called UTF-8, there is no decision to be taken as to decoderbehaviour - any byte sequence not permitted by the Unicode standard mustresult in an error (although, of course, *how* the error is to be reportedcould legitimately be the subject of endless discussion). There aresecurity implications to violating the standard so this isn't justlegalistic purity.


Hmmm, doesn't look good:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

'\xed\xb0\x80'.decode ('utf-8')

u'\udc00'

Incorrect! Although this is a narrow build - I can't say what the widebuild would do.

For reasons of practicality, it may be appropriate to provide easy accessto a CESU-8 decoder in addition to the normal UTF-8 decoder, but it mustnot be called UTF-8. Other variations may also find use if provided.


See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt

And CESU-8 technical report: http://www.unicode.org/reports/tr26/

Isaac Morland                   CSCF Web Guru
DC 2554C, x36650                WWW Software Specialist
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

Reply via email to