Re: [Python-3000] String comparison

2007-06-14 Thread Martin v. Löwis
> - chr(i) returns a len-1 or len-2 string for all i in range(0, 0x11) and > ord(chr(i)) == i for all i in range(0, 0x11) This would contradict an explicit decision in PEP 261. I'm don't quite remember the rationale for that, however, the PEP mentions that ord() should be symmetric with

Re: [Python-3000] String comparison

2007-06-14 Thread Stephen J. Turnbull
Jim Jewett writes: > I suspect there may be others that are guaranteed never to get an > assignment, because of their location. (Example: The character would > have to have certain properties or be part of a specific script, but > adding more such characters would violate some other stabilit

Re: [Python-3000] String comparison

2007-06-14 Thread Jim Jewett
On 6/14/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > > There are also plenty of things that a native speaker may view as a > > single character, but which unicode treats as (at most) a Named > > Sequence. > Eg, the New Line Function (Unicode's name for "universal newline"), > which can

Re: [Python-3000] String comparison

2007-06-14 Thread Jim Jewett
On 6/14/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > There are also some that are explicitly not characters. > > (U+FD00..U+FDEF) > ??? U+FD00 is ARABIC LIGATURE HAH WITH YEH ISOLATED FORM, > U+FDEF is unassigned. Sorry; typo on my part. The start of the range is u+fdD0, not 00. I suspe

Re: [Python-3000] String comparison

2007-06-14 Thread Rauli Ruohonen
On 6/13/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > except that people will sneak in some UTF-16 behavior where it seems useful. How about sneaking these in py3k-struni: - chr(i) returns a len-1 or len-2 string for all i in range(0, 0x11) and ord(chr(i)) == i for all i in range(0,

Re: [Python-3000] String comparison

2007-06-14 Thread Rauli Ruohonen
On 6/14/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 6/13/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > A code point is something that has a 1:1 relationship with a logical > > character (in particular, a Unicode character). As the word "character" is ambiguous, I'd put it this way:

Re: [Python-3000] String comparison

2007-06-14 Thread Stephen J. Turnbull
Jim Jewett writes: > > Apart from the surrogates, are there code points that aren't > > characters? > Yes. The BOM mark, for one. Nitpick: The BOM *is* a character (FEFF, aka ZERO-WIDTH NO-BREAK SPACE). Its byte-swapped counterpart FFFE is guaranteed *not* to be a character. (Martin wrote

Re: [Python-3000] String comparison

2007-06-13 Thread Martin v. Löwis
> Yes. The BOM mark, for one. Actually, the BOM *is* a character: ZERO WIDTH NO-BREAK SPACE, character class Cf. This function of the code point (as a character) is deprecated, though. > There are also some that are explicitly not characters. > (U+FD00..U+FDEF) ??? U+FD00 is ARABIC LIGATURE HAH

Re: [Python-3000] String comparison

2007-06-13 Thread Jim Jewett
On 6/13/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 6/13/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > A code point is something that has a 1:1 relationship with a logical > > character (in particular, a Unicode character). and > > A code unit is the atomic base in some encoding.

Re: [Python-3000] String comparison

2007-06-13 Thread Martin v. Löwis
> Thanks for clearing that up. It sounds like we really use code units, > not code points (except when building with the 4-byte Unicode option, > when they are equivalent). Is there anywhere were we use code points, > apart from the UTF-8 codecs, which encode properly matched surrogate > pairs as a

Re: [Python-3000] String comparison

2007-06-13 Thread Guido van Rossum
On 6/13/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > >> Until one or more of the senior developers says otherwise, I'm going > >> to assume that. > > > > Yeah, what's the difference between code units and points? > > A code unit is the atomic base in some encoding. It is a single byte > in mo

Re: [Python-3000] String comparison

2007-06-13 Thread Martin v. Löwis
>> Until one or more of the senior developers says otherwise, I'm going >> to assume that. > > Yeah, what's the difference between code units and points? A code unit is the atomic base in some encoding. It is a single byte in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit quantity

Re: [Python-3000] String comparison

2007-06-13 Thread Martin v. Löwis
> I think we've reached a dead end. AIUI, that's a matter for a PEP, > and the window for Python 3 is closed. I'm pretty sure that Python 3 > is going to have sequences of code units only (I know, Guido said > "code points", but I doubt he's read TR#17), except that people will > sneak in some UT

Re: [Python-3000] String comparison

2007-06-13 Thread Guido van Rossum
On 6/13/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > Rauli Ruohonen writes: > > > What I meant is that the stdlib should only have string operations > > that effectively work on (1) sequences of code units or (2) > > sequences of code points, and that the choice between these two > > sh

Re: [Python-3000] String comparison

2007-06-13 Thread Stephen J. Turnbull
Rauli Ruohonen writes: > What I meant is that the stdlib should only have string operations > that effectively work on (1) sequences of code units or (2) > sequences of code points, and that the choice between these two > should be made reasonably. I think we've reached a dead end. AIUI, tha

Re: [Python-3000] String comparison

2007-06-13 Thread Rauli Ruohonen
On 6/13/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > What you are saying is that if you write a 10-line script that claims > Unicode conformance, you are responsible for the Unicode-correctness of > all modules you call implicitly as well as that of the Python interpreter. If text files ar

Re: [Python-3000] String comparison

2007-06-12 Thread Stephen J. Turnbull
Rauli Ruohonen writes: > In my mind everything in a Python program is within a single > Unicode process, Which is a *serious* mistake. It is *precisely* the mistake that leads to mixing UTF-16 and UCS-2 interpretations in the standard library. What you are saying is that if you write a 10-lin

Re: [Python-3000] String comparison

2007-06-12 Thread Jim Jewett
On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > On 6/12/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > > On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > > > Practically speaking, there's little need to interpret > > > surrogate pairs as two code points instead of as one > > > non-BMP c

Re: [Python-3000] String comparison

2007-06-12 Thread Rauli Ruohonen
On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > Another example would be unichr(), which gives you TypeError if you > pass it a surrogate pair (oddly enough, as strings of different length > are of the same type). Sorry, I meant ord(), not unichr. Anyway, ord(unichr(i)) == i doesn't work f

Re: [Python-3000] String comparison

2007-06-12 Thread Rauli Ruohonen
On 6/12/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > > Practically speaking, there's little need to interpret surrogate pairs > > as two code points instead of as one non-BMP code point. > > Depends on your definition of "practically". > > Pyth

Re: [Python-3000] String comparison

2007-06-12 Thread Jim Jewett
On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > Practically > speaking, there's little need to interpret surrogate pairs as two > code points instead of as one non-BMP code point. Depends on your definition of "practically". Python does interpret them that way to maintain O(1) positional

Re: [Python-3000] String comparison

2007-06-12 Thread Rauli Ruohonen
On 6/10/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > I think you misunderstand. Anything in Unicode that is normative is > about interchange. Strings are also a means of interchange---between > modules (separate Unicode processes) in a program (single OS process). Like Martin said, "what

Re: [Python-3000] String comparison

2007-06-10 Thread Stephen J. Turnbull
Rauli Ruohonen writes: > On 6/9/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > > Rauli Ruohonen writes: > > > The ones it absolutely prohibits in interchange are surrogates. > > > > Excuse me? Surrogates are code points with a specific interpretation > > if it is "purported that the

Re: [Python-3000] String comparison

2007-06-09 Thread Rauli Ruohonen
On 6/9/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > Rauli Ruohonen writes: > > The ones it absolutely prohibits in interchange are surrogates. > > Excuse me? Surrogates are code points with a specific interpretation > if it is "purported that the stream is in UTF-16". Otherwise, Unicode

Re: [Python-3000] String comparison

2007-06-08 Thread Stephen J. Turnbull
Rauli Ruohonen writes: > The ones it absolutely prohibits in interchange are surrogates. Excuse me? Surrogates are code points with a specific interpretation if it is "purported that the stream is in UTF-16". Otherwise, Unicode 4.0 explicitly says that there is nothing illegal about an isolate

Re: [Python-3000] String comparison

2007-06-08 Thread Martin v. Löwis
> The additional field is 8 bits, two bits for each normalization (a > Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are > used, but I don't know if that's true of later versions. As > _PyUnicode_Database_Records stores only unique records, this also results > in an increase of

Re: [Python-3000] String comparison

2007-06-08 Thread Jim Jewett
On 6/8/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > The additional field is 8 bits, two bits for each normalization (a > Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are > used, but I don't know if that's true of later versions. There are no "Maybe" values for the Decompose

Re: [Python-3000] String comparison

2007-06-08 Thread Rauli Ruohonen
On 6/8/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > AFAIK, the only strings the Unicode standard absolutely prohibits > emitting are those containing code points guaranteed not to be > characters by the standard. The ones it absolutely prohibits in interchange are surrogates. They are also

Re: [Python-3000] String comparison

2007-06-08 Thread Rauli Ruohonen
On 6/8/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > In principle, yes. What's the cost of the additional field in terms of > a size increase? If you just need another bit, could that fit into > _PyUnicode_TypeRecord.flags instead? The additional field is 8 bits, two bits for each normalizati

Re: [Python-3000] String comparison

2007-06-08 Thread Stephen J. Turnbull
Guido van Rossum writes: > If you want to have an abstraction that guarantees you'll never see > an unnormalized text string you should design a library for doing so. OK. > (*) It looks like such a library will not have a way to talk about > "\u0308" at all, since it is considered unnormaliz

Re: [Python-3000] String comparison

2007-06-07 Thread Martin v. Löwis
> I implemented it for all normalizations in the most straightforward way I > could think of, which was adding a field to _PyUnicode_DatabaseRecord, > generating data for it in makeunicodedata.py from > DerivedNormalizationProps.txt of UCD 4.1, and writing a function > is_normalized which uses it.

Re: [Python-3000] String comparison

2007-06-07 Thread Stephen J. Turnbull
Rauli Ruohonen writes: Stephen wrote: > > I think the default case should be that text operations produce the > > expected result in the text domain, even at the expense of array > > invariants. > > If you really want that, then you need a type for sequences of graphemes. No. "Text" != "

Re: [Python-3000] String comparison

2007-06-07 Thread Rauli Ruohonen
On 6/6/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > FWIW, I don't buy that normalization is expensive, as most strings are > > in NFC form anyway, and there are fast checks for that (see UAX#15, > > "Detecting Normalization Forms"). Python does not currently have > > a fast path for this, b

Re: [Python-3000] String comparison

2007-06-07 Thread Rauli Ruohonen
On 6/8/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > How would you expect them to work on arrays of code points? Just like they do with Python 2.5 unicode objects, as long as the "array of code points" is str, not e.g. a numpy array or tuple of ints, which I don't expect to grow string methods :-)

Re: [Python-3000] String comparison

2007-06-07 Thread Jim Jewett
On 6/7/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > ... I will use XML character references to denote code points here. > Wherever you see such a thing in this e-mail, replace it in your > mind with the corresponding code point *immediately*. E.g. > len(r'�c5;') == 1, but len(r'\u00c5') == 6.

Re: [Python-3000] String comparison

2007-06-07 Thread Josiah Carlson
"Stephen J. Turnbull" <[EMAIL PROTECTED]> wrote: > Josiah Carlson writes: > > > Maybe I'm missing something, but it seems to me that there might be a > > simple solution. Don't normalize any identifiers or strings. > > That's not a solution, that's denying that there's a problem. For core Py

Re: [Python-3000] String comparison

2007-06-07 Thread Guido van Rossum
On 6/7/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > What bothers me about the "sequence of code points" way of thinking is > that len("Löwis") is nondeterministic. It doesn't have to be, *for this specific example*. After what I've read so far, I'm okay with normalization happening on the

Re: [Python-3000] String comparison

2007-06-07 Thread Rauli Ruohonen
On 6/7/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > I apologize for mistyping the example. *I* *was* talking about a > string literal containing Unicode characters. Then I misunderstood you too. To avoid such problems, I will use XML character references to denote code points here. Wherev

Re: [Python-3000] String comparison

2007-06-07 Thread Bill Janssen
> Then you wouldn't even be able to iterate over or index strings anymore, > as that could produce such "invalid" strings, which would need to > generate exceptions if you really want to ban them. I don't think that's right: iterating over the the string should presumably generate a iteration of v

Re: [Python-3000] String comparison

2007-06-07 Thread Stephen J. Turnbull
Guido van Rossum writes: > No it cannot. We are talking about \u escapes, not about a string > literal containing Unicode characters ("Löwis"). Ah, good point. I apologize for mistyping the example. *I* *was* talking about a string literal containing Unicode characters. However, on my termin

Re: [Python-3000] String comparison

2007-06-07 Thread Stephen J. Turnbull
Josiah Carlson writes: > Maybe I'm missing something, but it seems to me that there might be a > simple solution. Don't normalize any identifiers or strings. That's not a solution, that's denying that there's a problem. > Hear me out for a moment. People type what they want. You're thinkin

Re: [Python-3000] String comparison

2007-06-06 Thread Steve Howell
--- Guido van Rossum <[EMAIL PROTECTED]> wrote: > http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf > (Conformance) > > > > C9 A process shall not assume that the > interpretations of two > > canonical-equivalent character sequences are > distinct. > > That is surely contained inside all sort

Re: [Python-3000] String comparison

2007-06-06 Thread Rauli Ruohonen
On 6/7/07, Bill Janssen <[EMAIL PROTECTED]> wrote: > I meant to say that *strings* are explicitly sequences of characters, > not codepoints. This is false. When you access the contents of a string using the *sequence* protocol, what you get is code points, not characters (grapheme clusters). To ge

Re: [Python-3000] String comparison

2007-06-06 Thread Bill Janssen
I wrote: > Guido wrote: > > So let me explain it. I see two different sequences of code points: > > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', > > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that > > claim they are equivalent. They are two different

Re: [Python-3000] String comparison

2007-06-06 Thread Bill Janssen
> But > if someone didn't want normalization, and Python did it anyways, then > there would be an error that passed silently. Then they'd read it as bytes, and do the processing themselves explicitly (actually, what I do). > It's the unicode character versus code point issue. I personally prefer

Re: [Python-3000] String comparison

2007-06-06 Thread Bill Janssen
> So let me explain it. I see two different sequences of code points: > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308', > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that > claim they are equivalent. They are two different sequences of code > points. If

Re: [Python-3000] String comparison

2007-06-06 Thread Jim Jewett
On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 6/6/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > > On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > > > > > > about normalization of data strings. The big issue is string literals. > > > > I think I agree with Stephen here: > > >

Re: [Python-3000] String comparison

2007-06-06 Thread Jim Jewett
On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 6/6/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > > On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > > > Why should the lexer apply normalization to literals behind my back? > > The lexer shouldn't, but NFC normalizing the sourc

Re: [Python-3000] String comparison

2007-06-06 Thread Guido van Rossum
On 6/6/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > > > > about normalization of data strings. The big issue is string literals. > > > I think I agree with Stephen here: > > > > u"L\u00F6wis" == u"Lo\u0308wis" > > > > should be True (assu

Re: [Python-3000] String comparison

2007-06-06 Thread Jim Jewett
On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > > about normalization of data strings. The big issue is string literals. > > I think I agree with Stephen here: > > u"L\u00F6wis" == u"Lo\u0308wis" > > should be True (assuming he typed it correctly in the first place :-), > > because

Re: [Python-3000] String comparison

2007-06-06 Thread Jim Jewett
On 6/6/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > Rauli Ruohonen writes: > > FWIW, I don't buy that normalization is expensive, as most strings are > > in NFC form anyway, and there are fast checks for that (see UAX#15, > > "Detecting Normalization Forms"). Python does not currently h

Re: [Python-3000] String comparison

2007-06-06 Thread Josiah Carlson
Bill Janssen <[EMAIL PROTECTED]> wrote: > > > Hear me out for a moment. People type what they want. > > I do a lot of Pythonic processing of UTF-8, which is not "typed by > people", but instead extracted from documents by automated processing. > Text is also data -- an important thing to keep i

Re: [Python-3000] String comparison

2007-06-06 Thread Guido van Rossum
On 6/6/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote: > On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > > Why should the lexer apply normalization to literals behind my back? > > The lexer shouldn't, but NFC normalizing the source before the lexer > sees it would be slightly more robust and

Re: [Python-3000] String comparison

2007-06-06 Thread Martin v. Löwis
> > But I'm not about to change the == operator to apply normalization > > first. It would affect too much (e.g. hashing). > > Yah, that's one reason why Jim Jewett and I lean to normalizing on the > way in for explicitly Unicode data. But since that's not going to > happen, I guess the thing i

Re: [Python-3000] String comparison

2007-06-06 Thread Stephen J. Turnbull
Guido van Rossum writes: > But I'm not about to change the == operator to apply normalization > first. It would affect too much (e.g. hashing). Yah, that's one reason why Jim Jewett and I lean to normalizing on the way in for explicitly Unicode data. But since that's not going to happen, I gue

Re: [Python-3000] String comparison

2007-06-06 Thread Martin v. Löwis
Guido van Rossum schrieb: > Clearly we will have a normalization routine so the > lexer can normalize identifiers, so if you need normalized data it is > as simple as writing 'XXX'.normalize() (or whatever the spelling > should be). It's actually in Python already, and spelled as unicodedata.norma

Re: [Python-3000] String comparison

2007-06-06 Thread Martin v. Löwis
> FWIW, I don't buy that normalization is expensive, as most strings are > in NFC form anyway, and there are fast checks for that (see UAX#15, > "Detecting Normalization Forms"). Python does not currently have > a fast path for this, but if it's added, then normalizing everything > to NFC should be

Re: [Python-3000] String comparison

2007-06-06 Thread Rauli Ruohonen
On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > Why should the lexer apply normalization to literals behind my back? The lexer shouldn't, but NFC normalizing the source before the lexer sees it would be slightly more robust and standards-compliant. This is because technically an editor or

Re: [Python-3000] String comparison

2007-06-06 Thread Guido van Rossum
On 6/6/07, Bill Janssen <[EMAIL PROTECTED]> wrote: > > Hear me out for a moment. People type what they want. > > I do a lot of Pythonic processing of UTF-8, which is not "typed by > people", but instead extracted from documents by automated processing. > Text is also data -- an important thing to

Re: [Python-3000] String comparison

2007-06-06 Thread Bill Janssen
> Hear me out for a moment. People type what they want. I do a lot of Pythonic processing of UTF-8, which is not "typed by people", but instead extracted from documents by automated processing. Text is also data -- an important thing to keep in mind. As far as normalization goes, I agree with yo

Re: [Python-3000] String comparison

2007-06-06 Thread Guido van Rossum
On 6/6/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > Rauli Ruohonen writes: > > > Strings are internal to Python. This is a whole separate issue from > > normalization of source code or its parts (such as identifiers). > > Agreed. But please note that we're not talking about representatio

Re: [Python-3000] String comparison

2007-06-06 Thread Josiah Carlson
"Stephen J. Turnbull" <[EMAIL PROTECTED]> wrote: > Rauli Ruohonen writes: > > > Strings are internal to Python. This is a whole separate issue from > > normalization of source code or its parts (such as identifiers). > > Agreed. But please note that we're not talking about representation. > W

[Python-3000] String comparison

2007-06-06 Thread Stephen J. Turnbull
Rauli Ruohonen writes: > Strings are internal to Python. This is a whole separate issue from > normalization of source code or its parts (such as identifiers). Agreed. But please note that we're not talking about representation. We're talking about the result of evaluating a comparison: i

[Python-3000] String comparison

2007-06-06 Thread Rauli Ruohonen
(Martin's right, it's not good to discuss this in the huge PEP 3131 thread, so I'm changing the subject line) On 6/6/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > In the language of these standards, I would expect that string > comparison is exactly the kind of higher-level process they hav