Bugs item #1251300, was opened at 2005-08-03 21:49 Message generated for change (Comment added) made by lemburg You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1251300&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.5 Status: Open Resolution: None Priority: 5 Submitted By: nhaldimann (nhaldimann) Assigned to: M.-A. Lemburg (lemburg) Summary: Decoding with unicode_internal segfaults on UCS-4 builds Initial Comment: On UCS-4 builds, decoding a byte string with the unicode_internal codec doesn't correctly work for code points from 0x80000000 upwards and even segfaults. I have observed the same behaviour on 2.5 from CVS and 2.4.0 on OS X/PowerPC as well as on 2.3.5 on Linux/x86. Here's an example: Python 2.5a0 (#1, Aug 3 2005, 21:34:05) [GCC 3.3 20030304 (Apple Computer, Inc. build 1671)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> "\x7f\xff\xff\xff".decode("unicode_internal") u'\U7fffffff' >>> "\x80\x00\x00\x00".decode("unicode_internal") u'\x00' >>> "\x80\x00\x00\x01".decode("unicode_internal") u'\x01' >>> "\x81\x00\x00\x00".decode("unicode_internal") Segmentation fault On little endian architectures the byte strings must be reversed for the same effect. I'm not sure if I understand what's going on, but I guess there are 2 solution strategies: 1. Make unicode_internal work for any code point up to 0xFFFFFFFF. 2. Make unicode_internal raise a UnicodeDecodeError for anything above 0x10FFFF (== sys.maxunicode for UCS-4 builds). It seems like there are no unicode code points above 0x10FFFF, so the latter solution feels more correct to me, even though it might break backwards compatibility a tiny bit. The unicodeescape codec already does a similar thing: >>> u"\U00110000" UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character ---------------------------------------------------------------------- >Comment By: M.-A. Lemburg (lemburg) Date: 2005-08-04 16:41 Message: Logged In: YES user_id=38388 I think solution 2 is the right approach, since UCS-4 only has 0x10FFFF possible code points. Could you provide a patch ? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1251300&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com