> - chr(i) returns a len-1 or len-2 string for all i in range(0, 0x11) and
> ord(chr(i)) == i for all i in range(0, 0x11)
This would contradict an explicit decision in PEP 261. I'm don't quite
remember the rationale for that, however, the PEP mentions that ord()
should be symmetric with
Jim Jewett writes:
> I suspect there may be others that are guaranteed never to get an
> assignment, because of their location. (Example: The character would
> have to have certain properties or be part of a specific script, but
> adding more such characters would violate some other stabilit
On 6/14/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> > There are also plenty of things that a native speaker may view as a
> > single character, but which unicode treats as (at most) a Named
> > Sequence.
> Eg, the New Line Function (Unicode's name for "universal newline"),
> which can
On 6/14/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > There are also some that are explicitly not characters.
> > (U+FD00..U+FDEF)
> ??? U+FD00 is ARABIC LIGATURE HAH WITH YEH ISOLATED FORM,
> U+FDEF is unassigned.
Sorry; typo on my part. The start of the range is u+fdD0, not 00.
I suspe
On 6/13/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> except that people will sneak in some UTF-16 behavior where it seems useful.
How about sneaking these in py3k-struni:
- chr(i) returns a len-1 or len-2 string for all i in range(0, 0x11) and
ord(chr(i)) == i for all i in range(0,
On 6/14/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 6/13/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > A code point is something that has a 1:1 relationship with a logical
> > character (in particular, a Unicode character).
As the word "character" is ambiguous, I'd put it this way:
Jim Jewett writes:
> > Apart from the surrogates, are there code points that aren't
> > characters?
> Yes. The BOM mark, for one.
Nitpick: The BOM *is* a character (FEFF, aka ZERO-WIDTH NO-BREAK
SPACE). Its byte-swapped counterpart FFFE is guaranteed *not* to be a
character. (Martin wrote
> Yes. The BOM mark, for one.
Actually, the BOM *is* a character: ZERO WIDTH NO-BREAK SPACE,
character class Cf. This function of the code point (as a character)
is deprecated, though.
> There are also some that are explicitly not characters.
> (U+FD00..U+FDEF)
??? U+FD00 is ARABIC LIGATURE HAH
On 6/13/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 6/13/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > A code point is something that has a 1:1 relationship with a logical
> > character (in particular, a Unicode character).
and
> > A code unit is the atomic base in some encoding.
> Thanks for clearing that up. It sounds like we really use code units,
> not code points (except when building with the 4-byte Unicode option,
> when they are equivalent). Is there anywhere were we use code points,
> apart from the UTF-8 codecs, which encode properly matched surrogate
> pairs as a
On 6/13/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> >> Until one or more of the senior developers says otherwise, I'm going
> >> to assume that.
> >
> > Yeah, what's the difference between code units and points?
>
> A code unit is the atomic base in some encoding. It is a single byte
> in mo
>> Until one or more of the senior developers says otherwise, I'm going
>> to assume that.
>
> Yeah, what's the difference between code units and points?
A code unit is the atomic base in some encoding. It is a single byte
in most encodings, but a 16-bit quantity in UTF-16 (and a 32-bit
quantity
> I think we've reached a dead end. AIUI, that's a matter for a PEP,
> and the window for Python 3 is closed. I'm pretty sure that Python 3
> is going to have sequences of code units only (I know, Guido said
> "code points", but I doubt he's read TR#17), except that people will
> sneak in some UT
On 6/13/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> Rauli Ruohonen writes:
>
> > What I meant is that the stdlib should only have string operations
> > that effectively work on (1) sequences of code units or (2)
> > sequences of code points, and that the choice between these two
> > sh
Rauli Ruohonen writes:
> What I meant is that the stdlib should only have string operations
> that effectively work on (1) sequences of code units or (2)
> sequences of code points, and that the choice between these two
> should be made reasonably.
I think we've reached a dead end. AIUI, tha
On 6/13/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> What you are saying is that if you write a 10-line script that claims
> Unicode conformance, you are responsible for the Unicode-correctness of
> all modules you call implicitly as well as that of the Python interpreter.
If text files ar
Rauli Ruohonen writes:
> In my mind everything in a Python program is within a single
> Unicode process,
Which is a *serious* mistake. It is *precisely* the mistake that
leads to mixing UTF-16 and UCS-2 interpretations in the standard
library. What you are saying is that if you write a 10-lin
On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> On 6/12/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> > On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> > > Practically speaking, there's little need to interpret
> > > surrogate pairs as two code points instead of as one
> > > non-BMP c
On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> Another example would be unichr(), which gives you TypeError if you
> pass it a surrogate pair (oddly enough, as strings of different length
> are of the same type).
Sorry, I meant ord(), not unichr. Anyway, ord(unichr(i)) == i doesn't
work f
On 6/12/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> > Practically speaking, there's little need to interpret surrogate pairs
> > as two code points instead of as one non-BMP code point.
>
> Depends on your definition of "practically".
>
> Pyth
On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> Practically
> speaking, there's little need to interpret surrogate pairs as two
> code points instead of as one non-BMP code point.
Depends on your definition of "practically".
Python does interpret them that way to maintain O(1) positional
On 6/10/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> I think you misunderstand. Anything in Unicode that is normative is
> about interchange. Strings are also a means of interchange---between
> modules (separate Unicode processes) in a program (single OS process).
Like Martin said, "what
Rauli Ruohonen writes:
> On 6/9/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> > Rauli Ruohonen writes:
> > > The ones it absolutely prohibits in interchange are surrogates.
> >
> > Excuse me? Surrogates are code points with a specific interpretation
> > if it is "purported that the
On 6/9/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> Rauli Ruohonen writes:
> > The ones it absolutely prohibits in interchange are surrogates.
>
> Excuse me? Surrogates are code points with a specific interpretation
> if it is "purported that the stream is in UTF-16". Otherwise, Unicode
Rauli Ruohonen writes:
> The ones it absolutely prohibits in interchange are surrogates.
Excuse me? Surrogates are code points with a specific interpretation
if it is "purported that the stream is in UTF-16". Otherwise, Unicode
4.0 explicitly says that there is nothing illegal about an isolate
> The additional field is 8 bits, two bits for each normalization (a
> Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are
> used, but I don't know if that's true of later versions. As
> _PyUnicode_Database_Records stores only unique records, this also results
> in an increase of
On 6/8/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> The additional field is 8 bits, two bits for each normalization (a
> Yes/Maybe/No value). In Unicode 4.1 only 5 different combinations are
> used, but I don't know if that's true of later versions.
There are no "Maybe" values for the Decompose
On 6/8/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> AFAIK, the only strings the Unicode standard absolutely prohibits
> emitting are those containing code points guaranteed not to be
> characters by the standard.
The ones it absolutely prohibits in interchange are surrogates. They
are also
On 6/8/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> In principle, yes. What's the cost of the additional field in terms of
> a size increase? If you just need another bit, could that fit into
> _PyUnicode_TypeRecord.flags instead?
The additional field is 8 bits, two bits for each normalizati
Guido van Rossum writes:
> If you want to have an abstraction that guarantees you'll never see
> an unnormalized text string you should design a library for doing so.
OK.
> (*) It looks like such a library will not have a way to talk about
> "\u0308" at all, since it is considered unnormaliz
> I implemented it for all normalizations in the most straightforward way I
> could think of, which was adding a field to _PyUnicode_DatabaseRecord,
> generating data for it in makeunicodedata.py from
> DerivedNormalizationProps.txt of UCD 4.1, and writing a function
> is_normalized which uses it.
Rauli Ruohonen writes:
Stephen wrote:
> > I think the default case should be that text operations produce the
> > expected result in the text domain, even at the expense of array
> > invariants.
>
> If you really want that, then you need a type for sequences of graphemes.
No. "Text" != "
On 6/6/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > FWIW, I don't buy that normalization is expensive, as most strings are
> > in NFC form anyway, and there are fast checks for that (see UAX#15,
> > "Detecting Normalization Forms"). Python does not currently have
> > a fast path for this, b
On 6/8/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> How would you expect them to work on arrays of code points?
Just like they do with Python 2.5 unicode objects, as long as the
"array of code points" is str, not e.g. a numpy array or tuple of ints,
which I don't expect to grow string methods :-)
On 6/7/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> ... I will use XML character references to denote code points here.
> Wherever you see such a thing in this e-mail, replace it in your
> mind with the corresponding code point *immediately*. E.g.
> len(r'�c5;') == 1, but len(r'\u00c5') == 6.
"Stephen J. Turnbull" <[EMAIL PROTECTED]> wrote:
> Josiah Carlson writes:
>
> > Maybe I'm missing something, but it seems to me that there might be a
> > simple solution. Don't normalize any identifiers or strings.
>
> That's not a solution, that's denying that there's a problem.
For core Py
On 6/7/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> What bothers me about the "sequence of code points" way of thinking is
> that len("Löwis") is nondeterministic.
It doesn't have to be, *for this specific example*. After what I've
read so far, I'm okay with normalization happening on the
On 6/7/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> I apologize for mistyping the example. *I* *was* talking about a
> string literal containing Unicode characters.
Then I misunderstood you too. To avoid such problems, I will use XML
character references to denote code points here. Wherev
> Then you wouldn't even be able to iterate over or index strings anymore,
> as that could produce such "invalid" strings, which would need to
> generate exceptions if you really want to ban them.
I don't think that's right: iterating over the the string should
presumably generate a iteration of v
Guido van Rossum writes:
> No it cannot. We are talking about \u escapes, not about a string
> literal containing Unicode characters ("Löwis").
Ah, good point.
I apologize for mistyping the example. *I* *was* talking about a
string literal containing Unicode characters. However, on my
termin
Josiah Carlson writes:
> Maybe I'm missing something, but it seems to me that there might be a
> simple solution. Don't normalize any identifiers or strings.
That's not a solution, that's denying that there's a problem.
> Hear me out for a moment. People type what they want.
You're thinkin
--- Guido van Rossum <[EMAIL PROTECTED]> wrote:
>
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
> (Conformance)
> >
> > C9 A process shall not assume that the
> interpretations of two
> > canonical-equivalent character sequences are
> distinct.
>
> That is surely contained inside all sort
On 6/7/07, Bill Janssen <[EMAIL PROTECTED]> wrote:
> I meant to say that *strings* are explicitly sequences of characters,
> not codepoints.
This is false. When you access the contents of a string using the
*sequence* protocol, what you get is code points, not characters
(grapheme clusters). To ge
I wrote:
> Guido wrote:
> > So let me explain it. I see two different sequences of code points:
> > 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> > 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> > claim they are equivalent. They are two different
> But
> if someone didn't want normalization, and Python did it anyways, then
> there would be an error that passed silently.
Then they'd read it as bytes, and do the processing themselves
explicitly (actually, what I do).
> It's the unicode character versus code point issue. I personally prefer
> So let me explain it. I see two different sequences of code points:
> 'L', '\u00F6', 'w', 'i', 's' on the one hand, and 'L', 'o', '\u0308',
> 'w', 'i', 's' on the other. Never mind that Unicode has semantics that
> claim they are equivalent. They are two different sequences of code
> points.
If
On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 6/6/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> > On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> >
> > > > about normalization of data strings. The big issue is string literals.
> > > > I think I agree with Stephen here:
> > >
On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 6/6/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> > On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> > > Why should the lexer apply normalization to literals behind my back?
> > The lexer shouldn't, but NFC normalizing the sourc
On 6/6/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
>
> > > about normalization of data strings. The big issue is string literals.
> > > I think I agree with Stephen here:
>
> > > u"L\u00F6wis" == u"Lo\u0308wis"
>
> > > should be True (assu
On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> > about normalization of data strings. The big issue is string literals.
> > I think I agree with Stephen here:
> > u"L\u00F6wis" == u"Lo\u0308wis"
> > should be True (assuming he typed it correctly in the first place :-),
> > because
On 6/6/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> Rauli Ruohonen writes:
> > FWIW, I don't buy that normalization is expensive, as most strings are
> > in NFC form anyway, and there are fast checks for that (see UAX#15,
> > "Detecting Normalization Forms"). Python does not currently h
Bill Janssen <[EMAIL PROTECTED]> wrote:
>
> > Hear me out for a moment. People type what they want.
>
> I do a lot of Pythonic processing of UTF-8, which is not "typed by
> people", but instead extracted from documents by automated processing.
> Text is also data -- an important thing to keep i
On 6/6/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> > Why should the lexer apply normalization to literals behind my back?
>
> The lexer shouldn't, but NFC normalizing the source before the lexer
> sees it would be slightly more robust and
> > But I'm not about to change the == operator to apply normalization
> > first. It would affect too much (e.g. hashing).
>
> Yah, that's one reason why Jim Jewett and I lean to normalizing on the
> way in for explicitly Unicode data. But since that's not going to
> happen, I guess the thing i
Guido van Rossum writes:
> But I'm not about to change the == operator to apply normalization
> first. It would affect too much (e.g. hashing).
Yah, that's one reason why Jim Jewett and I lean to normalizing on the
way in for explicitly Unicode data. But since that's not going to
happen, I gue
Guido van Rossum schrieb:
> Clearly we will have a normalization routine so the
> lexer can normalize identifiers, so if you need normalized data it is
> as simple as writing 'XXX'.normalize() (or whatever the spelling
> should be).
It's actually in Python already, and spelled as
unicodedata.norma
> FWIW, I don't buy that normalization is expensive, as most strings are
> in NFC form anyway, and there are fast checks for that (see UAX#15,
> "Detecting Normalization Forms"). Python does not currently have
> a fast path for this, but if it's added, then normalizing everything
> to NFC should be
On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> Why should the lexer apply normalization to literals behind my back?
The lexer shouldn't, but NFC normalizing the source before the lexer
sees it would be slightly more robust and standards-compliant. This is
because technically an editor or
On 6/6/07, Bill Janssen <[EMAIL PROTECTED]> wrote:
> > Hear me out for a moment. People type what they want.
>
> I do a lot of Pythonic processing of UTF-8, which is not "typed by
> people", but instead extracted from documents by automated processing.
> Text is also data -- an important thing to
> Hear me out for a moment. People type what they want.
I do a lot of Pythonic processing of UTF-8, which is not "typed by
people", but instead extracted from documents by automated processing.
Text is also data -- an important thing to keep in mind.
As far as normalization goes, I agree with yo
On 6/6/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> Rauli Ruohonen writes:
>
> > Strings are internal to Python. This is a whole separate issue from
> > normalization of source code or its parts (such as identifiers).
>
> Agreed. But please note that we're not talking about representatio
"Stephen J. Turnbull" <[EMAIL PROTECTED]> wrote:
> Rauli Ruohonen writes:
>
> > Strings are internal to Python. This is a whole separate issue from
> > normalization of source code or its parts (such as identifiers).
>
> Agreed. But please note that we're not talking about representation.
> W
Rauli Ruohonen writes:
> Strings are internal to Python. This is a whole separate issue from
> normalization of source code or its parts (such as identifiers).
Agreed. But please note that we're not talking about representation.
We're talking about the result of evaluating a comparison:
i
(Martin's right, it's not good to discuss this in the huge PEP 3131
thread, so I'm changing the subject line)
On 6/6/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> In the language of these standards, I would expect that string
> comparison is exactly the kind of higher-level process they hav
64 matches
Mail list logo