[issue5127] UnicodeEncodeError - I can't even see license

2009-10-06 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: So the discussion is now on 2 points: 1. Is the change backwards compatible? (at the code level, after recompilation). My answer is yes, because all known case transformations stay in the same plane: if you pass a char in the BMP, they

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-06 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: It's not as easy as that. The functions for case conversion are used in a way that assumes they never fail (and indeed, the existing functions cannot fail). What we can do is change the input parameter to Py_UCS4, but not the Py_UNICODE

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-06 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: that would cause lots of compiler warnings and implicit truncation on UCS2 builds Unfortunately, there is no such warning, or the initial problem we are trying to solve would have been spotted by such a warning (unicode_repr() calls

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-06 Thread Amaury Forgeot d'Arc
Changes by Amaury Forgeot d'Arc amaur...@gmail.com: Added file: http://bugs.python.org/file15058/unicodectype_ucs4_3.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Adam Olsen wrote: Adam Olsen rha...@gmail.com added the comment: Surrogates aren't optional features of UTF-16, we really need to get this fixed. That includes .isalpha(). We use UCS2 on narrow Python builds, not UTF-16. We might

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: No, but changing the APIs from 16-bit integers to 32-bit integers does require a recompile of all code using it. Is it acceptable between 3.1 and 3.2 for example? ISTM that other changes already require recompilation of extension

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: No, but changing the APIs from 16-bit integers to 32-bit integers does require a recompile of all code using it. Is it acceptable between 3.1 and

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: we should make sure that it's not possible to load an extension compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns. This is the case with this patch: today all these functions (_PyUnicode_IsAlpha,

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. That has always been the case. UCS2 doesn't support surrogates. However, we have been slowly moving into the

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: we should make sure that it's not possible to load an extension compiled with 3.1 in 3.2 to prevent segfaults and buffer overruns. This is the

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: This is off-topic for the tracker item, but I'll reply anyway: Ezio Melotti wrote: Ezio Melotti ezio.melo...@gmail.com added the comment: We might keep the old public API for compatibility, but it should be clearly marked as broken

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: We'd need to expose the UCS4 APIs *in addition* to those APIs and have the UCS2 APIs redirect to the UCS4 ones. Why have two names for the same function? it's Python 3, after all. Or is this no recompile feature so important (as long

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: We'd need to expose the UCS4 APIs *in addition* to those APIs and have the UCS2 APIs redirect to the UCS4 ones. Why have two names for the same

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Adam Olsen
Adam Olsen rha...@gmail.com added the comment: On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg rep...@bugs.python.org wrote: We use UCS2 on narrow Python builds, not UTF-16. We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: Adam Olsen wrote: Adam Olsen rha...@gmail.com added the comment: On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg rep...@bugs.python.org wrote: We use UCS2 on narrow Python builds, not UTF-16. We might keep the old public API for

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-05 Thread Adam Olsen
Adam Olsen rha...@gmail.com added the comment: On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg rep...@bugs.python.org wrote: All this is just nitpicking, really. UCS2 is a character set, UTF-16 an encoding. UCS is a character set, for most purposes synonymous with the Unicode character set.

[issue5127] UnicodeEncodeError - I can't even see license

2009-10-04 Thread Adam Olsen
Adam Olsen rha...@gmail.com added the comment: Surrogates aren't optional features of UTF-16, we really need to get this fixed. That includes .isalpha(). We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. I don't see a

[issue5127] UnicodeEncodeError - I can't even see license

2009-09-24 Thread Ezio Melotti
Changes by Ezio Melotti ezio.melo...@gmail.com: -- priority: - normal stage: - patch review ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: FWIW, on Python3 it seems to work: import unicodedata unicodedata.category(\U0001) 'Lo' unicodedata.category(\U00011000) 'Cn' unicodedata.category(chr(0x1)) 'Lo' unicodedata.category(chr(0x11000)) 'Cn' ord(chr(0x1)),

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: Since r56395, ord() and chr() accept and return surrogate pairs even in narrow builds. The goal is to remove most differences between narrow and wide unicode builds (except for string lengths, indices or slices) To address this

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: amaury Since r56395, ord() and chr() accept and return surrogate pairs amaury even in narrow builds. Note: My examples are made with Python 2.x. The goal is to remove most differences between narrow and wide unicode builds

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: That would cause major breakage in the C API Not if you recompile. I don't see how this breaks the API at the C level. and is not inline with the intention of having a Py_UNICODE type in the first place. Py_UNICODE is still used

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: On 2009-02-03 13:39, Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: Since r56395, ord() and chr() accept and return surrogate pairs even in narrow builds. The goal is to remove most differences

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: I don't understand the behaviour of unichr(): Python 2.7a0 (trunk:68963M, Jan 30 2009, 00:49:28) import unicodedata unicodedata.category(u\U0001) 'Lo' unicodedata.category(u\U00011000) 'Cn'

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: On 2009-02-03 14:14, STINNER Victor wrote: STINNER Victor victor.stin...@haypocalc.com added the comment: amaury Since r56395, ord() and chr() accept and return surrogate pairs amaury even in narrow builds. Note: My examples are made

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread STINNER Victor
STINNER Victor victor.stin...@haypocalc.com added the comment: lemburg This is not possible for unichr() in Python 2.x, since applications lemburg always expect len(unichr(x)) == 1 Oh, ok. lemburg Changing ord() would be possible in Python 2.x is easier, since lemburg this would only extend

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Marc-Andre Lemburg
Marc-Andre Lemburg m...@egenix.com added the comment: On 2009-02-03 14:50, Amaury Forgeot d'Arc wrote: Amaury Forgeot d'Arc amaur...@gmail.com added the comment: That would cause major breakage in the C API Not if you recompile. I don't see how this breaks the API at the C level. Well,

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: haypo ord() of Python3 (narrow build) rejects surrogate characters: haypo '\U0001' haypo len(chr(0x1)) haypo 2 haypo ord(0x1) haypo TypeError: ord() expected string of length 1, but int found ord() works fine on Py3, you

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-03 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: I must be missing some detail, but what does the Unicode database have to do with the unicodeobject.c C API ? Ah, now I understand your concerns. My suggestion is to change only the 20 functions in unicodectype.c:

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-02 Thread Amaury Forgeot d'Arc
Amaury Forgeot d'Arc amaur...@gmail.com added the comment: There were non-ascii characters in the Windows license file. This was corrected with r67860. I believe that chr(0x1) and chr(0x11000) should have the opposite behavior. This other problem is because on a narrow unicode build,

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-01 Thread Venusaur
New submission from Venusaur bup...@hotmail.com: license Traceback (most recent call last): File stdin, line 1, in module File C:\Python30\lib\site.py, line 372, in __repr__ self.__setup() File C:\Python30\lib\site.py, line 359, in __setup data = fp.read() File

[issue5127] UnicodeEncodeError - I can't even see license

2009-02-01 Thread Ezio Melotti
Ezio Melotti ezio.melo...@gmail.com added the comment: Here (winxpsp2, Py3, cp850-terminal) the license works fine: license Type license() to see the full license text and license() works as well. I get this output for the chr()s: chr(0x1) '\U0001' chr(0x11000) Traceback (most