Benjamin Peterson benja...@python.org added the comment:
This is a duplicate of #3297, and Adam's patch there fixes it.
--
nosy: +benjamin.peterson
resolution: - duplicate
status: open - closed
superseder: - Python interpreter uses Unicode surrogate pairs only before the
pyc is
Martin v. Löwis mar...@v.loewis.de added the comment:
I can't reproduce that; it prints fine for me.
Notice that it is perfectly fine for Python to represent this as two
code points in UCS-2 mode (so that len(s)==2); this is called UTF-16.
--
nosy: +loewis
Ezio Melotti ezio.melo...@gmail.com added the comment:
I can't reproduce it either on Ubuntu 9.04 32-bit. I tried both from the
terminal and from the file, using Py3.2a0.
As Martin said, the fact that in narrow builds of Python the codepoints
outside the BMP are represented with two surrogate
Arc Riley arcri...@gmail.com added the comment:
Python 3.1.1 (r311:74480, Sep 13 2009, 22:19:17)
[GCC 4.4.1] on linux2
Type help, copyright, credits or license for more information.
import sys
sys.maxunicode
1114111
u = 'ё'
print(u)
Traceback (most recent call last):
File stdin, line 1, in
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
The page:
http://www.fileformat.info/info/unicode/char/d801/index.htm
has a big warning saying that U+D801 is not a valid unicode character.
The problem is similar to issue6697, and lead to the same question:
should python validate
Arc Riley arcri...@gmail.com added the comment:
Amaury, you are absolutely correct, \ud801 is not a valid unicode glyph,
however I am not giving Python \ud801, I am giving Python 'ё' (==
'\U00010451').
I am attaching a different short example that demonstrates that Python
is mishandling UTF-8
Adam Olsen rha...@gmail.com added the comment:
I believe this is a duplicate of issue #3297. When given a high unicode
scalar value directly in the source (rather than in escaped form) python
will split it into surrogates, even on a UTF-32 build where those
surrogates are nonsensical and
Arc Riley arcri...@gmail.com added the comment:
This behavior is identical whether u.py or u.pyc is run on my systems,
where that previous ticket concerns differing behavior.
Though it is obviously related.
--
versions: -Python 2.6, Python 3.0
___
New submission from Arc Riley arcri...@gmail.com:
The following is a minimal example which does not work under Python
3.1.1 but functions as expected on Pyhton 2.6 and 3.0.
Python 3.1.1 believes the single UTF-8 glyph is two entirely different
(and illegal) unicode characters:
Traceback (most
Arc Riley arcri...@gmail.com added the comment:
While t.py only bugs on 3.1, the following happens with 3.0 as well:
line = 'ёѧѕёѦљ'
first = 'ё'
first
'ё'
line[0]
'\ud801'
line[0] == first
False
And with 2.6:
line = u'ёѧѕёѦљ'
first = u'ё'
first
u'\ud801\udc51'
--
versions:
10 matches
Mail list logo