[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2010-06-18 Thread STINNER Victor
Changes by STINNER Victor victor.stin...@haypocalc.com: -- status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___ ___

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2010-06-14 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: We are too close from the final 2.7 release, it's too late to backport. As I wrote, this feature is not important and there are many workaround, so we don't need to backport to 3.1. Close the issue: use Python 3.2 if you want a

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2010-06-09 Thread Terry J. Reedy
Changes by Terry J. Reedy tjre...@udel.edu: -- versions: -Python 2.4, Python 2.5, Python 3.0 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2010-05-21 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: @benjamin.peterson: Do you plan to port r75928 to 2.7 and 3.1? If not, can you close this issue? I think that this issue priority is minor because few people write directly non-BMP characters in Python files (maybe only one, Ezio

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-28 Thread Benjamin Peterson
Changes by Benjamin Peterson benja...@python.org: -- dependencies: +UnicodeEncodeError - I can't even see license ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-28 Thread Benjamin Peterson
Benjamin Peterson benja...@python.org added the comment: Committed Adam's patch in r75928. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-04 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: This last point is already tracked by issue5127. -- nosy: +amaury.forgeotdarc ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-04 Thread Adam Olsen
Adam Olsen rha...@gmail.com added the comment: Patch, which uses UTF-32-BE as indicated in my last comment. Test included. -- keywords: +patch Added file: http://bugs.python.org/file15043/py3k-nonBMP-literal.diff ___ Python tracker

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-04 Thread Adam Olsen
Adam Olsen rha...@gmail.com added the comment: With some further prodding I've noticed that although the test behaves as expected in the py3k branch (fails on UTF-32 builds before the patch), it doesn't fail using python 3.0. I'm guessing there's interactions with compile() vs import and the

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-03 Thread Adam Olsen
Adam Olsen rha...@gmail.com added the comment: Looks like the failure mode has changed here, presumably due to issue #3672 patches. It now always fails, even after loading from a .pyc. This is using py3k via bzr, which reports itself as 3.2a0 $ rm unicodetest.pyc $ ./python -c 'import

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-10-03 Thread Adam Olsen
Adam Olsen rha...@gmail.com added the comment: I've traced down the biggest problem to decode_unicode in ast.c. It needs to convert everything into a form of escapes so it becomes pure ascii, which then become evaluated back into a unicode object. Unfortunately, it uses UTF-16-BE to do so,

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-04-28 Thread Lino Mastrodomenico
Changes by Lino Mastrodomenico l.mastrodomen...@gmail.com: -- nosy: +l.mastrodomenico ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___ ___

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2009-04-25 Thread Jakub Wilk
Changes by Jakub Wilk uba...@users.sf.net: -- nosy: +jwilk ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-12-21 Thread hippietrail
Changes by hippietrail hippytr...@gmail.com: -- nosy: +hippietrail ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___ ___ Python-bugs-list

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-09-02 Thread Adam Olsen
Adam Olsen [EMAIL PROTECTED] added the comment: Marc, I don't understand what you're saying. UTF-16's surrogates are not optional. Unicode 2.0 and later require them, and Python is supposed to support it. Likewise, UCS-4 originally allowed a much larger range of code points, but it no longer

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-09-02 Thread Adam Olsen
Adam Olsen [EMAIL PROTECTED] added the comment: I've got another report open about the codecs not properly reporting errors relating to surrogates: issue 3672 ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3297

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-09-01 Thread Marc-Andre Lemburg
Marc-Andre Lemburg [EMAIL PROTECTED] added the comment: On 2008-08-29 23:33, Terry J. Reedy wrote: Terry J. Reedy [EMAIL PROTECTED] added the comment: Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16 vs. UTF-32) I recently read most of the Unicode 5 standard and as

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-08-29 Thread Terry J. Reedy
Terry J. Reedy [EMAIL PROTECTED] added the comment: Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16 vs. UTF-32) I recently read most of the Unicode 5 standard and as near as I could tell it no longer uses the term UCS, if it ever did. Chapter 3 has only the following 3

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-08-21 Thread Benjamin Peterson
Benjamin Peterson [EMAIL PROTECTED] added the comment: Ping. -- nosy: +benjamin.peterson ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3297 ___ ___

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-08-04 Thread Antoine Pitrou
Changes by Antoine Pitrou [EMAIL PROTECTED]: -- priority: - critical versions: +Python 2.6 ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3297 ___ ___

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-12 Thread Marc-Andre Lemburg
Marc-Andre Lemburg [EMAIL PROTECTED] added the comment: Adam, I do know what I'm talking about: I was the lead designer of the Unicode integration you find in Python and implemented most of it. What you see as repr() of a Unicode object is the result of applying a codec to the internal

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-12 Thread Adam Olsen
Adam Olsen [EMAIL PROTECTED] added the comment: Marc, perhaps Unicode has refined their definitions since you last looked? Valid UTF-8 *cannot* contain surrogates[1]. If it does, you have CESU-8[2][3], not UTF-8. So there are two bugs: first, the UTF-8 codec should refuse to load surrogates.

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-12 Thread Adam Olsen
Adam Olsen [EMAIL PROTECTED] added the comment: Err, to clarify, the parse/compile/whatever stages is producing broken UTF-32 (surrogates are ill-formed there too), and that gets transformed into CESU-8 when the .pyc is saved. ___ Python tracker [EMAIL

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-11 Thread Ezio Melotti
Ezio Melotti [EMAIL PROTECTED] added the comment: On my Linux box sys.maxunicode == 1114111 and len(u'\U00010123') == 1, so it should be a UTF-32 build. On windows instead sys.maxunicode == 65535 and len(u'\U00010123') == 2, so it should be a UTF-16 build. The problem seems then related to

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-11 Thread Adam Olsen
Adam Olsen [EMAIL PROTECTED] added the comment: Simpler way to reproduce this (on linux): $ rm unicodetest.pyc $ $ python -c 'import unicodetest' Result: False Len: 2 1 Repr: u'\ud800\udd23' u'\U00010123' $ $ python -c 'import unicodetest' Result: True Len: 1 1 Repr: u'\U00010123'

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-11 Thread Marc-Andre Lemburg
Marc-Andre Lemburg [EMAIL PROTECTED] added the comment: Just to clarify: Python can be built as UCS2 or UCS4 build (not UTF-16 vs. UTF-32). The conversions done from the literal escaped representation to the internal format are done using the unicode-escape and raw-unicode-escape codecs. PYC

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-11 Thread Adam Olsen
Adam Olsen [EMAIL PROTECTED] added the comment: No, the configure options are wrong - we do use UTF-16 and UTF-32. Although modern UCS-4 has been restricted down to the range of UTF-32 (it used to be larger!), UCS-2 still doesn't support the supplementary planes (ie no surrogates.) If it

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

2008-07-06 Thread Ezio Melotti
New submission from Ezio Melotti [EMAIL PROTECTED]: Problem: when you have Unicode characters with a code point greater than U+ written directly in the source file (that is, not in the form u'\U' but as normal chars in a u'' string) the interpreter uses surrogate pairs for