[issue7045] utf-8 encoding error

2009-10-27 Thread Benjamin Peterson
Benjamin Peterson added the comment: This is a duplicate of #3297, and Adam's patch there fixes it. -- nosy: +benjamin.peterson resolution: -> duplicate status: open -> closed superseder: -> Python interpreter uses Unicode surrogate pairs only before the pyc is created _

[issue7045] utf-8 encoding error

2009-10-03 Thread Arc Riley
Arc Riley added the comment: This behavior is identical whether u.py or u.pyc is run on my systems, where that previous ticket concerns differing behavior. Though it is obviously related. -- versions: -Python 2.6, Python 3.0 ___ Python tracker

[issue7045] utf-8 encoding error

2009-10-03 Thread Adam Olsen
Adam Olsen added the comment: I believe this is a duplicate of issue #3297. When given a high unicode scalar value directly in the source (rather than in escaped form) python will split it into surrogates, even on a UTF-32 build where those surrogates are nonsensical and ill-formed. Patches fo

[issue7045] utf-8 encoding error

2009-10-03 Thread Arc Riley
Arc Riley added the comment: Amaury, you are absolutely correct, \ud801 is not a valid unicode glyph, however I am not giving Python \ud801, I am giving Python '𐑑' (== '\U00010451'). I am attaching a different short example that demonstrates that Python is mishandling UTF-8 on both the interact

[issue7045] utf-8 encoding error

2009-10-03 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc added the comment: The page: http://www.fileformat.info/info/unicode/char/d801/index.htm has a big warning saying that "U+D801 is not a valid unicode character." The problem is similar to issue6697, and lead to the same question: should python validate utf-8 input, and refu

[issue7045] utf-8 encoding error

2009-10-03 Thread Arc Riley
Arc Riley added the comment: Python 3.1.1 (r311:74480, Sep 13 2009, 22:19:17) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.maxunicode 1114111 >>> u = '𐑑' >>> print(u) Traceback (most recent call last): File "", line 1, in

[issue7045] utf-8 encoding error

2009-10-03 Thread Ezio Melotti
Ezio Melotti added the comment: I can't reproduce it either on Ubuntu 9.04 32-bit. I tried both from the terminal and from the file, using Py3.2a0. As Martin said, the fact that in narrow builds of Python the codepoints outside the BMP are represented with two surrogate pairs is a known "issue"

[issue7045] utf-8 encoding error

2009-10-03 Thread Martin v . LΓΆwis
Martin v. LΓΆwis added the comment: I can't reproduce that; it prints fine for me. Notice that it is perfectly fine for Python to represent this as two code points in UCS-2 mode (so that len(s)==2); this is called UTF-16. -- nosy: +loewis ___ Python

[issue7045] utf-8 encoding error

2009-10-02 Thread Arc Riley
Arc Riley added the comment: While t.py only bugs on 3.1, the following happens with 3.0 as well: >>> line = '𐑑𐑧𐑕𐑑𐑦𐑙' >>> first = '𐑑' >>> first '𐑑' >>> line[0] '\ud801' >>> line[0] == first False And with 2.6: >>> line = u'𐑑𐑧𐑕𐑑𐑦𐑙' >>> first = u'𐑑' >>> first u'\ud801\udc51' -- versions

[issue7045] utf-8 encoding error

2009-10-02 Thread Arc Riley
New submission from Arc Riley : The following is a minimal example which does not work under Python 3.1.1 but functions as expected on Pyhton 2.6 and 3.0. Python 3.1.1 believes the single UTF-8 glyph is two entirely different (and illegal) unicode characters: Traceback (most recent call last):