New submission from Ezio Melotti <[EMAIL PROTECTED]>: Problem: when you have Unicode characters with a code point greater than U+FFFF written directly in the source file (that is, not in the form u'\Uxxxxxxxx' but as normal chars in a u'' string) the interpreter uses surrogate pairs for representing these characters only if the pyc doesn't exist. When the pyc is created it uses a "normal" character (\Uxxxxxxxx instead of the pair \uxxxx\uxxxx). This could lead to an unexpected behavior while comparing Unicode strings or in other situations (even if it could be solved without problems in different ways - using u'\Uxxxxxxx' or u'\uxxx' instead of the characters, encoding them before comparing - there shouldn't be differences between a py and its pyc).
Tested on: Ubuntu 8.04 with python 2.4: Uses a surrogate pair. Ubuntu 8.04 with python 2.5: Uses a surrogate pair. Windows XP SP2 with python 2.4: Uses a "normal" character. Steps to reproduce the problem: 1a. download the attached file or create it following the next step; 1b. in a UTF-8-aware console write `print unichr(int('10123', 16))` (or any codepoint >= 10000), copy the printed character (depending on the console it could be a box, two box or a character) in a file with the lines `# -*- coding: utf-8 -*-`, `print 'Result:', u'<paste here the char>' == u'\U00010123'` and `print 'Repr:', repr(u'<paste here the char>'), repr(u'\U00010123')`. Save the file in UTF-8; 2. open a python interpreter and import the file (`import unicodetest`). It should print `Result: False` and `Repr: u'\ud800\udd23' u'\U00010123'` (the character is represented as a surrogate pair). During this step the pyc file is created. 3. from the python interpreter write `reload(unicodetest)`. Now it should print `Result: True` and `Repr: u'\U00010123' u'\U00010123'` (the char is represented as a "normal" character). Any other reload will print True. If you delete the pyc and reload again it will print False. (Instead of using reload() is also possible to create a function and call it from the module when it's loaded and again with unicodetest.func(), the result will be the same.) Expected behavior: The interpreter should use the same representation in both the situation (and print True in both the tests). Another solution could be to change the behavior of == to return True if a normal char is compared with its surrogate pair (if it makes sense). Further informations: The character used for the test is part of the "Unicode Plane 1" (see http://en.wikipedia.org/wiki/Basic_Multilingual_Plane). More information about the surrogate pairs can be found here: http://en.wikipedia.org/wiki/Surrogate_pair#Encoding_of_characters_outside_the_BMP ---------- components: Unicode files: unicodetest.py messages: 69321 nosy: ezio.melotti, lemburg severity: normal status: open title: Python interpreter uses Unicode surrogate pairs only before the pyc is created type: behavior versions: Python 2.4, Python 2.5 Added file: http://bugs.python.org/file10826/unicodetest.py _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3297> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com