[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Adam Olsen Sat, 03 Oct 2009 22:44:14 -0700

Adam Olsen <[email protected]> added the comment:

I've traced down the biggest problem to decode_unicode in ast.c.  It
needs to convert everything into a form of escapes so it becomes pure
ascii, which then become evaluated back into a unicode object. 
Unfortunately, it uses UTF-16-BE to do so, which always split
surrogates.  Switching it to UTF-32-BE is fairly straightforward, and
works even on UTF-16 (or "narrow") builds.


Incidentally, there's no point using the surrogatepass error handler
once we actually support surrogates.

Unfortunately there's a second problem in repr(). 
'\U0001010F'.isprintable() returns True on UTF-32 builds and False on
UTF-16 builds.  This causes repr() to escape it unnecessarily on UTF-16
builds.  repr() at least joins surrogate pairs before its internally
printable test (unlike .isprintable() or any other str method), but it
turns out all of the APIs in unicodectype.c only accept a single 16-bit
int in UTF-16 builds anyway.  That'll be a bigger patch than the first part.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue3297>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Reply via email to