Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
So the discussion is now on 2 points:
1. Is the change backwards compatible? (at the code level, after
recompilation). My answer is yes, because all known case
transformations stay in the same plane: if you pass a char in the BMP,
they
Marc-Andre Lemburg m...@egenix.com added the comment:
It's not as easy as that.
The functions for case conversion are used in a way that assumes they
never fail (and indeed, the existing functions cannot fail).
What we can do is change the input parameter to Py_UCS4, but not the
Py_UNICODE
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
that would cause lots of compiler
warnings and implicit truncation on UCS2 builds
Unfortunately, there is no such warning, or the initial problem we are trying
to solve would have been spotted by such a warning (unicode_repr() calls
Changes by Amaury Forgeot d'Arc amaur...@gmail.com:
Added file: http://bugs.python.org/file15058/unicodectype_ucs4_3.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
Marc-Andre Lemburg m...@egenix.com added the comment:
Adam Olsen wrote:
Adam Olsen rha...@gmail.com added the comment:
Surrogates aren't optional features of UTF-16, we really need to get
this fixed. That includes .isalpha().
We use UCS2 on narrow Python builds, not UTF-16.
We might
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
No, but changing the APIs from 16-bit integers to 32-bit integers
does require a recompile of all code using it.
Is it acceptable between 3.1 and 3.2 for example? ISTM that other
changes already require recompilation of extension
Marc-Andre Lemburg m...@egenix.com added the comment:
Amaury Forgeot d'Arc wrote:
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
No, but changing the APIs from 16-bit integers to 32-bit integers
does require a recompile of all code using it.
Is it acceptable between 3.1 and
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
we should make sure that it's not possible to load an extension
compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.
This is the case with this patch: today all these functions
(_PyUnicode_IsAlpha,
Ezio Melotti ezio.melo...@gmail.com added the comment:
We might keep the old public API for compatibility, but it should be
clearly marked as broken for non-BMP scalar values.
That has always been the case. UCS2 doesn't support surrogates.
However, we have been slowly moving into the
Marc-Andre Lemburg m...@egenix.com added the comment:
Amaury Forgeot d'Arc wrote:
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
we should make sure that it's not possible to load an extension
compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns.
This is the
Marc-Andre Lemburg m...@egenix.com added the comment:
This is off-topic for the tracker item, but I'll reply anyway:
Ezio Melotti wrote:
Ezio Melotti ezio.melo...@gmail.com added the comment:
We might keep the old public API for compatibility, but it should be
clearly marked as broken
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
We'd need to expose the UCS4 APIs *in addition*
to those APIs and have the UCS2 APIs redirect to the UCS4 ones.
Why have two names for the same function? it's Python 3, after all.
Or is this no recompile feature so important (as long
Marc-Andre Lemburg m...@egenix.com added the comment:
Amaury Forgeot d'Arc wrote:
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
We'd need to expose the UCS4 APIs *in addition*
to those APIs and have the UCS2 APIs redirect to the UCS4 ones.
Why have two names for the same
Adam Olsen rha...@gmail.com added the comment:
On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg rep...@bugs.python.org wrote:
We use UCS2 on narrow Python builds, not UTF-16.
We might keep the old public API for compatibility, but it should be
clearly marked as broken for non-BMP scalar
Marc-Andre Lemburg m...@egenix.com added the comment:
Adam Olsen wrote:
Adam Olsen rha...@gmail.com added the comment:
On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg rep...@bugs.python.org
wrote:
We use UCS2 on narrow Python builds, not UTF-16.
We might keep the old public API for
Adam Olsen rha...@gmail.com added the comment:
On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg rep...@bugs.python.org wrote:
All this is just nitpicking, really. UCS2 is a character set,
UTF-16 an encoding.
UCS is a character set, for most purposes synonymous with the Unicode
character set.
Adam Olsen rha...@gmail.com added the comment:
Surrogates aren't optional features of UTF-16, we really need to get
this fixed. That includes .isalpha().
We might keep the old public API for compatibility, but it should be
clearly marked as broken for non-BMP scalar values.
I don't see a
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
priority: - normal
stage: - patch review
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue5127
___
Ezio Melotti ezio.melo...@gmail.com added the comment:
FWIW, on Python3 it seems to work:
import unicodedata
unicodedata.category(\U0001)
'Lo'
unicodedata.category(\U00011000)
'Cn'
unicodedata.category(chr(0x1))
'Lo'
unicodedata.category(chr(0x11000))
'Cn'
ord(chr(0x1)),
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
Since r56395, ord() and chr() accept and return surrogate pairs even in
narrow builds.
The goal is to remove most differences between narrow and wide unicode
builds (except for string lengths, indices or slices)
To address this
STINNER Victor victor.stin...@haypocalc.com added the comment:
amaury Since r56395, ord() and chr() accept and return surrogate pairs
amaury even in narrow builds.
Note: My examples are made with Python 2.x.
The goal is to remove most differences between narrow and wide unicode
builds
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
That would cause major breakage in the C API
Not if you recompile. I don't see how this breaks the API at the C level.
and is not inline with the intention of having a Py_UNICODE
type in the first place.
Py_UNICODE is still used
Marc-Andre Lemburg m...@egenix.com added the comment:
On 2009-02-03 13:39, Amaury Forgeot d'Arc wrote:
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
Since r56395, ord() and chr() accept and return surrogate pairs even in
narrow builds.
The goal is to remove most differences
STINNER Victor victor.stin...@haypocalc.com added the comment:
I don't understand the behaviour of unichr():
Python 2.7a0 (trunk:68963M, Jan 30 2009, 00:49:28)
import unicodedata
unicodedata.category(u\U0001)
'Lo'
unicodedata.category(u\U00011000)
'Cn'
Marc-Andre Lemburg m...@egenix.com added the comment:
On 2009-02-03 14:14, STINNER Victor wrote:
STINNER Victor victor.stin...@haypocalc.com added the comment:
amaury Since r56395, ord() and chr() accept and return surrogate pairs
amaury even in narrow builds.
Note: My examples are made
STINNER Victor victor.stin...@haypocalc.com added the comment:
lemburg This is not possible for unichr() in Python 2.x, since applications
lemburg always expect len(unichr(x)) == 1
Oh, ok.
lemburg Changing ord() would be possible in Python 2.x is easier, since
lemburg this would only extend
Marc-Andre Lemburg m...@egenix.com added the comment:
On 2009-02-03 14:50, Amaury Forgeot d'Arc wrote:
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
That would cause major breakage in the C API
Not if you recompile. I don't see how this breaks the API at the C level.
Well,
Ezio Melotti ezio.melo...@gmail.com added the comment:
haypo ord() of Python3 (narrow build) rejects surrogate characters:
haypo '\U0001'
haypo len(chr(0x1))
haypo 2
haypo ord(0x1)
haypo TypeError: ord() expected string of length 1, but int found
ord() works fine on Py3, you
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
I must be missing some detail, but what does the Unicode database
have to do with the unicodeobject.c C API ?
Ah, now I understand your concerns. My suggestion is to change only the 20
functions in
unicodectype.c:
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
There were non-ascii characters in the Windows license file. This was
corrected with r67860.
I believe that chr(0x1) and chr(0x11000) should have the
opposite behavior.
This other problem is because on a narrow unicode build,
New submission from Venusaur bup...@hotmail.com:
license
Traceback (most recent call last):
File stdin, line 1, in module
File C:\Python30\lib\site.py, line 372, in __repr__
self.__setup()
File C:\Python30\lib\site.py, line 359, in __setup
data = fp.read()
File
Ezio Melotti ezio.melo...@gmail.com added the comment:
Here (winxpsp2, Py3, cp850-terminal) the license works fine:
license
Type license() to see the full license text
and license() works as well.
I get this output for the chr()s:
chr(0x1)
'\U0001'
chr(0x11000)
Traceback (most
32 matches
Mail list logo