Adam Olsen <rha...@gmail.com> added the comment: I've traced down the biggest problem to decode_unicode in ast.c. It needs to convert everything into a form of escapes so it becomes pure ascii, which then become evaluated back into a unicode object. Unfortunately, it uses UTF-16-BE to do so, which always split surrogates. Switching it to UTF-32-BE is fairly straightforward, and works even on UTF-16 (or "narrow") builds.
Incidentally, there's no point using the surrogatepass error handler once we actually support surrogates. Unfortunately there's a second problem in repr(). '\U0001010F'.isprintable() returns True on UTF-32 builds and False on UTF-16 builds. This causes repr() to escape it unnecessarily on UTF-16 builds. repr() at least joins surrogate pairs before its internally printable test (unlike .isprintable() or any other str method), but it turns out all of the APIs in unicodectype.c only accept a single 16-bit int in UTF-16 builds anyway. That'll be a bigger patch than the first part. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue3297> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com