Re: [Python-Dev] UCS2/UCS4 default

2008-07-04 Thread Martin v. Löwis
>> The premise is the OP's idea that Python should switch to all UCS4 to >> create a more pure ('ideal') situation or the idea that len(s) should >> count codepoints (correct term?) for all builds as a matter of purity >> even though on it would be time-costly on 16-bit builds as a matter >> of pr

Re: [Python-Dev] UCS2/UCS4 default

2008-07-04 Thread Joe Smith
Martin v. Löwis v.loewis.de> writes: > > > Wrong term - code units and code points are equivalent in UTF-16 and > > UTF-32. What you're looking for is unicode scalar values. > > How so? Section 2.5, UTF-16 says > > "code points in the supplementary planes, in the range > U+1..U+10, ar

Re: [Python-Dev] UCS2/UCS4 default

2008-07-04 Thread M.-A. Lemburg
On 2008-07-03 21:59, Steve Holden wrote: M.-A. Lemburg wrote: On 2008-07-03 19:44, Terry Reedy wrote: The premise of this thread seems to be that the majority should suffer for the benefit of a few. That is not Python's philosophy. In reality, most Unixes ship with UCS4 builds of Python. Win

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 4:50 PM, Adam Olsen <[EMAIL PROTECTED]> wrote: > Clearly, each surrogate is a valid code point, regardless of encoding. > A surrogate pair simultaneously represents both one code point (the > scalar value) and two code points (the surrogate code points). To be > unambiguous

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Adam Olsen
On Thu, Jul 3, 2008 at 4:21 PM, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen <[EMAIL PROTECTED]> wrote: >> On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <[EMAIL PROTECTED]> wrote: >>> >>> The premise is the OP's idea that Python should switch to all UCS4 to

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Martin v. Löwis
> Wrong term - code units and code points are equivalent in UTF-16 and > UTF-32. What you're looking for is unicode scalar values. How so? Section 2.5, UTF-16 says "code points in the supplementary planes, in the range U+1..U+10, are represented as pairs of 16-bit code units." So clearl

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 3:00 PM, Adam Olsen <[EMAIL PROTECTED]> wrote: > On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <[EMAIL PROTECTED]> wrote: >> >> The premise is the OP's idea that Python should switch to all UCS4 to create >> a more pure ('ideal') situation or the idea that len(s) should count >

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Adam Olsen
On Thu, Jul 3, 2008 at 3:01 PM, Terry Reedy <[EMAIL PROTECTED]> wrote: > > The premise is the OP's idea that Python should switch to all UCS4 to create > a more pure ('ideal') situation or the idea that len(s) should count > codepoints (correct term?) for all builds as a matter of purity even thoug

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Terry Reedy
Guido van Rossum wrote: On Thu, Jul 3, 2008 at 10:44 AM, Terry Reedy <[EMAIL PROTECTED]> wrote: The premise of this thread seems to be that the majority should suffer for the benefit of a few. That is not Python's philosophy. The premise is the OP's idea that Python should switch to all UCS

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Steve Holden
M.-A. Lemburg wrote: On 2008-07-03 19:44, Terry Reedy wrote: The premise of this thread seems to be that the majority should suffer for the benefit of a few. That is not Python's philosophy. In reality, most Unixes ship with UCS4 builds of Python. Windows and Mac OS X ship with UCS2 builds. S

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread M.-A. Lemburg
On 2008-07-03 19:44, Terry Reedy wrote: The premise of this thread seems to be that the majority should suffer for the benefit of a few. That is not Python's philosophy. In reality, most Unixes ship with UCS4 builds of Python. Windows and Mac OS X ship with UCS2 builds. Still, anyone is free t

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread M.-A. Lemburg
On 2008-07-03 19:35, Jeroen Ruigrok van der Werven wrote: -On [20080703 19:21], Adam Olsen ([EMAIL PROTECTED]) wrote: On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: Please remember that lone surrogate pair code points are perfectly valid Unicode code points, neverthele

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread M.-A. Lemburg
On 2008-07-03 19:21, Adam Olsen wrote: On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote: -On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote: Unicode if full of combining code points - if you break such

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Adam Olsen
On Thu, Jul 3, 2008 at 11:35 AM, Jeroen Ruigrok van der Werven <[EMAIL PROTECTED]> wrote: > -On [20080703 19:21], Adam Olsen ([EMAIL PROTECTED]) wrote: >>On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: >>> Please remember that lone surrogate pair code points are perfectly >

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 19:31], "Martin v. Löwis" ([EMAIL PROTECTED]) wrote: >Yes, but it is two code units. Python's UTF-16 implementation operates >on code units, not code points. Thank you, that is the single most important piece of information I got about this entire thing because it does change the ent

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 10:44 AM, Terry Reedy <[EMAIL PROTECTED]> wrote: > The premise of this thread seems to be that the majority should suffer for > the benefit of a few. That is not Python's philosophy. Who are the many here? Who are the few? I'd venture that (at least for the foreseeable futu

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Terry Reedy
Daniel Arbuckle wrote: Regardless, as I said before, nothing justifies silently changing the meaning of a program based on an option that most users don't set for themselves and are not aware of. The premise of this thread seems to be that the majority should suffer for the benefit of a few

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Bill Janssen
> Surely it's desirable under all circumstances that > >len(u) == sum(1 for c in u) > > and that > >[c for c in u] == [c[i] for i in range(*len(u))] > > How would that play under Jeroen's proposed change? Yes, but I think the argument is about what "c" is -- a character or a codepoint.

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Martin v. Löwis
> I think you want to use codePointCount() to count the Unicode code points. > length() returns Unicode code units. > > As http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html explains: > > In the J2SE API documentation, Unicode code point is used for character > values in the range b

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Martin v. Löwis
> Please remember that lone surrogate pair code points are perfectly > valid Unicode code points, nevertheless. Just as a lone combining > code point is valid on its own. Actually, I think they aren't (not any more than an invalid codepoint, or an unassigned codepoint). They are reserved for UTF-1

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 19:21], Adam Olsen ([EMAIL PROTECTED]) wrote: >On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: >> Please remember that lone surrogate pair code points are perfectly >> valid Unicode code points, nevertheless. Just as a lone combining >> code point is valid on

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Martin v. Löwis
> 1. System is NOT memory limited (i.e. most desktops): use a UCS-4 Python > build, which is what most Linux distributions do (I'm not sure about the > pydotorg provided Windows or Mac OS X builds). The Windows builds must continue to use a two-byte representation, as otherwise PythonWin will brea

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Martin v. Löwis
> Basically everything but string forming or string printing seems to be > broken for surrogate pairs, from what I can tell. We probably disagree what "it works correctly" means. I think everything works correctly. > Also, I think you are confused about slicing in the middle of a surrogate > pair

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Adam Olsen
On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote: >> >> -On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote: >>> >>> Unicode if full of combining code points - if you break such a sequence, >>> the output w

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 10:01 AM, Jeroen Ruigrok van der Werven <[EMAIL PROTECTED]> wrote: > What would the chances for inclusion in Python be if such a PEP + code would > be presented Guido? As long as it is clear that the len() function and the basic slicing and indexing operations on strings con

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread James Y Knight
On Jul 3, 2008, at 10:46 AM, Jeroen Ruigrok van der Werven wrote: -On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote: Your seem to be suggesting that len(u"\U00012345") should return 1 on a system that internally uses UTF-16 and hence represents this string as a surrogate pair.

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 18:45], James Y Knight ([EMAIL PROTECTED]) wrote: >I think this is misguided. Only trying to at least correct the current situation, which I consider a bit of a mess, personally. (Although it seems others share my view.) >I'd like to have 3 levels of access available: >1) "byte"-lev

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 17:03], Guido van Rossum ([EMAIL PROTECTED]) wrote: >I don't see an answer there to the question of whether the length() >method of a Java String object containing a single surrogate pair >returns 1 or 2; I suspect it returns 2. As http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Ch

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 9:35 AM, Steve Holden <[EMAIL PROTECTED]> wrote: > Paul Moore wrote: >> >> On 03/07/2008, Guido van Rossum <[EMAIL PROTECTED]> wrote: >>> >>> I don't see an answer there to the question of whether the length() >>> method of a Java String object containing a single surrogate p

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 17:32], Paul Moore ([EMAIL PROTECTED]) wrote: >System.out.println(s.length()); I think you want to use codePointCount() to count the Unicode code points. length() returns Unicode code units. As http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html explains: In th

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Steve Holden
Paul Moore wrote: On 03/07/2008, Guido van Rossum <[EMAIL PROTECTED]> wrote: I don't see an answer there to the question of whether the length() method of a Java String object containing a single surrogate pair returns 1 or 2; I suspect it returns 2. It appears you're right: type testucs.jav

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Armin Ronacher
Guido van Rossum python.org> writes: > The one thing that may be missing from Python is things like > interpretation of surrogates by functions like isalpha() and I'm okay > with adding that (since those have to loop over the entire string > anyway). That and methods to safely iterate and slice s

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Paul Moore
On 03/07/2008, Guido van Rossum <[EMAIL PROTECTED]> wrote: > I don't see an answer there to the question of whether the length() > method of a Java String object containing a single surrogate pair > returns 1 or 2; I suspect it returns 2. It appears you're right: >type testucs.java class testucs

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Amaury Forgeot d'Arc
Hello, 2008/7/3 Guido van Rossum <[EMAIL PROTECTED]>: > I don't see an answer there to the question of whether the length() > method of a Java String object containing a single surrogate pair > returns 1 or 2; I suspect it returns 2. Python 3 supports things like > chr(0x12345) and ord("\U00012345

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 7:46 AM, Jeroen Ruigrok van der Werven <[EMAIL PROTECTED]> wrote: > -On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote: >>Your seem to be suggesting that len(u"\U00012345") should return 1 on >>a system that internally uses UTF-16 and hence represents this strin

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 15:58], Guido van Rossum ([EMAIL PROTECTED]) wrote: >Your seem to be suggesting that len(u"\U00012345") should return 1 on >a system that internally uses UTF-16 and hence represents this string >as a surrogate pair. From a Unicode and UTF-16 point of view that makes the most sense. S

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Daniel Arbuckle
On Thu, Jul 3, 2008 at 6:42 AM, Mark Hammond <[EMAIL PROTECTED]> wrote: > For people on Windows, win32 isn't a "compatibility" consideration. I > suspect most users of the other platforms MAL mentioned and all others with > their own native unicode implementations would agree. I'm sorry, but you'

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Guido van Rossum
On Thu, Jul 3, 2008 at 3:48 AM, Jeroen Ruigrok van der Werven <[EMAIL PROTECTED]> wrote: > My apologies for hammering on this, but I think it is quite important and > currently Python 3.0 seems confused about UCS-2 versus UTF-16. [...] Your seem to be suggesting that len(u"\U00012345") should retu

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread M.-A. Lemburg
On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote: -On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote: Unicode if full of combining code points - if you break such a sequence, the output will be just as wrong; regardless of UCS2 vs. UCS4. In my opinion you are confusing two r

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Mark Hammond
> For programmers who want to target a 2-byte format (for win32 > compatibility, for example) As MAL said, this is taking the discussion in the wrong direction. For people on Windows, win32 isn't a "compatibility" consideration. I suspect most users of the other platforms MAL mentioned and all o

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
-On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote: >Unicode if full of combining code points - if you break such a sequence, >the output will be just as wrong; regardless of UCS2 vs. UCS4. In my opinion you are confusing two related, but very separated things here. Combining characters

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Daniel Arbuckle
On Thu, Jul 3, 2008 at 5:39 AM, Nick Coghlan <[EMAIL PROTECTED]> wrote: > 1. If you are advocating disallowing the use of characters outside the BMP > in a UCS-2 build, enumerate the advantages of doing so (paying particular > attention to any advantages which cannot be obtained simply by using an

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread M.-A. Lemburg
I think the discussion is going in the wrong direction: The choice between UCS2 and UCS4 builds is really only meant to enhance the possibility to interface to native OS or application APIs, e.g. Windows LIBC and Java use UTF-16, glibc on Unix uses UCS4. The problem of slicing Unicode objects is

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Nick Coghlan
Jeroen Ruigrok van der Werven wrote: The documentation for len() says: Return the length (the number of items) of an object. So what this tells us is that in a UCS-2 build of Python, the "items" in a unicode string are not, strictly speaking, Unicode code points or characters. Instead, they a

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Antoine Pitrou
Hi, > Subsequently doing a: print a[1] to get the 0x942a (鐪) actually requires > a[2] on the 2-byte Python 3.0. How is it annoying *in practice*? In actual code the index, instead of being a constant, will be retrieved through various means such as .find() or re.search().start()... as you show y

Re: [Python-Dev] UCS2/UCS4 default

2008-07-03 Thread Jeroen Ruigrok van der Werven
My apologies for hammering on this, but I think it is quite important and currently Python 3.0 seems confused about UCS-2 versus UTF-16. -On [20080702 20:47], Guido van Rossum ([EMAIL PROTECTED]) wrote: >No, Python already is aware of surrogates. I meant applications >processing non-BMP text shoul

Re: [Python-Dev] UCS2/UCS4 default

2008-07-02 Thread Guido van Rossum
On Wed, Jul 2, 2008 at 11:35 AM, Jeroen Ruigrok van der Werven <[EMAIL PROTECTED]> wrote: > -On [20080702 20:27], Guido van Rossum ([EMAIL PROTECTED]) wrote: >>I disagree. Instead, I would say that such code needs to be aware of >>surrogates. > > Just to make sure I understood you: > > Python's cod

Re: [Python-Dev] UCS2/UCS4 default

2008-07-02 Thread Jeroen Ruigrok van der Werven
-On [20080702 20:27], Guido van Rossum ([EMAIL PROTECTED]) wrote: >I disagree. Instead, I would say that such code needs to be aware of >surrogates. Just to make sure I understood you: Python's code needs to be made aware of surrogates? If so, do you want me to log issues for the things encounte

Re: [Python-Dev] UCS2/UCS4 default

2008-07-02 Thread Guido van Rossum
On Wed, Jul 2, 2008 at 11:22 AM, Jeroen Ruigrok van der Werven <[EMAIL PROTECTED]> wrote: > -On [20080702 19:42], Guido van Rossum ([EMAIL PROTECTED]) wrote: >>Yes. At least in the sense that \U gets translated to a >>surrogate pair, and that the UTF-8 codec supports surrogate pairs in >>bo

Re: [Python-Dev] UCS2/UCS4 default

2008-07-02 Thread Jeroen Ruigrok van der Werven
-On [20080702 19:42], Guido van Rossum ([EMAIL PROTECTED]) wrote: >Yes. At least in the sense that \U gets translated to a >surrogate pair, and that the UTF-8 codec supports surrogate pairs in >both directions. It's been like this for a long time. What else would >you expect from UTF-16 sup

Re: [Python-Dev] UCS2/UCS4 default

2008-07-02 Thread Guido van Rossum
On Wed, Jul 2, 2008 at 10:19 AM, Jeroen Ruigrok van der Werven <[EMAIL PROTECTED]> wrote: > -On [20080702 19:08], Guido van Rossum ([EMAIL PROTECTED]) wrote: >>I think we should continue to leave this up to the distribution. AFAIK >>many Linux distros already use UCS4 for everything anyway. > > Fre

Re: [Python-Dev] UCS2/UCS4 default

2008-07-02 Thread Jeroen Ruigrok van der Werven
-On [20080702 19:08], Guido van Rossum ([EMAIL PROTECTED]) wrote: >I think we should continue to leave this up to the distribution. AFAIK >many Linux distros already use UCS4 for everything anyway. FreeBSD's ports makes it a configure option. >For that reason I think it's also better that the con

Re: [Python-Dev] UCS2/UCS4 default

2008-07-02 Thread Guido van Rossum
I think we should continue to leave this up to the distribution. AFAIK many Linux distros already use UCS4 for everything anyway. The alternative (no matter what the configure flag is called) is UTF-16, not UCS-2 though: there is support for surrogate pairs in various places, including the \U esca

[Python-Dev] UCS2/UCS4 default

2008-07-02 Thread Jeroen Ruigrok van der Werven
Guido (and others of course), back in 2001 you pointed out that you wanted to move to UCS4 completely as the ideal situation (http://mail.python.org/pipermail/i18n-sig/2001-June/001107.html) over the current default UCS2. Given 3.0 will use Unicode strings as the default, would it also not make s