On 6/13/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> except that people will sneak in some UTF-16 behavior where it seems useful.
How about sneaking these in py3k-struni:
- chr(i) returns a len-1 or len-2 string for all i in range(0, 0x11) and
ord(chr(i)) == i for all i in range(0,
On 6/14/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 6/13/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > A code point is something that has a 1:1 relationship with a logical
> > character (in particular, a Unicode character).
As the word "character" is ambiguous, I'd put it this way:
On 6/13/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> What you are saying is that if you write a 10-line script that claims
> Unicode conformance, you are responsible for the Unicode-correctness of
> all modules you call implicitly as well as that of the Python interpreter.
If text files ar
On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> Another example would be unichr(), which gives you TypeError if you
> pass it a surrogate pair (oddly enough, as strings of different length
> are of the same type).
Sorry, I meant ord(), not unichr. Anyway, ord(unichr(i)) ==
On 6/12/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> On 6/12/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> > Practically speaking, there's little need to interpret surrogate pairs
> > as two code points instead of as one non-BMP code point.
>
> Depe
On 6/10/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> I think you misunderstand. Anything in Unicode that is normative is
> about interchange. Strings are also a means of interchange---between
> modules (separate Unicode processes) in a program (single OS process).
Like Martin said, "what
On 6/12/07, Baptiste Carvello <[EMAIL PROTECTED]> wrote:
> This is where we strongly disagree. If an identifier is written in
> transliterated chinese, I cannot understand what it means, but I can
> recognise it when it is used in the code. I will then find out the
> meaning from the context. By co
On 6/11/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> "In fact, it might even use something downright misleading, and
> you won't have any warning, because we thought that maybe someone,
> somewhere, might have wanted that character in a different context."
>
> And no, I don't think I'm exagerati
On 6/10/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > To truly enable Python in a non-English teaching
> > environment, I think you'd actually want to go a step
> > further and just internationalize the whole program.
>
> I don't know why that theory keeps popping up when people
> have repea
On 6/9/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> Rauli Ruohonen writes:
> > The ones it absolutely prohibits in interchange are surrogates.
>
> Excuse me? Surrogates are code points with a specific interpretation
> if it is "purported that the stream is in
On 6/8/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> AFAIK, the only strings the Unicode standard absolutely prohibits
> emitting are those containing code points guaranteed not to be
> characters by the standard.
The ones it absolutely prohibits in interchange are surrogates. They
are also
On 6/8/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> In principle, yes. What's the cost of the additional field in terms of
> a size increase? If you just need another bit, could that fit into
> _PyUnicode_TypeRecord.flags instead?
The additional field is 8 bits, two bits for each normalizati
On 6/6/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > FWIW, I don't buy that normalization is expensive, as most strings are
> > in NFC form anyway, and there are fast checks for that (see UAX#15,
> > "Detecting Normalization Forms"). Python does not currently have
> > a fast path for this, b
On 6/8/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> How would you expect them to work on arrays of code points?
Just like they do with Python 2.5 unicode objects, as long as the
"array of code points" is str, not e.g. a numpy array or tuple of ints,
which I don't expect to grow string methods :-)
On 6/7/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> I apologize for mistyping the example. *I* *was* talking about a
> string literal containing Unicode characters.
Then I misunderstood you too. To avoid such problems, I will use XML
character references to denote code points here. Wherev
On 6/7/07, Bill Janssen <[EMAIL PROTECTED]> wrote:
> I meant to say that *strings* are explicitly sequences of characters,
> not codepoints.
This is false. When you access the contents of a string using the
*sequence* protocol, what you get is code points, not characters
(grapheme clusters). To ge
On 6/6/07, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> Why should the lexer apply normalization to literals behind my back?
The lexer shouldn't, but NFC normalizing the source before the lexer
sees it would be slightly more robust and standards-compliant. This is
because technically an editor or
(Martin's right, it's not good to discuss this in the huge PEP 3131
thread, so I'm changing the subject line)
On 6/6/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> In the language of these standards, I would expect that string
> comparison is exactly the kind of higher-level process they hav
On 6/6/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> No. The point is that people want to use their current tools; they
> may not be able to easily specify normalization.
> Please look through the list (I've already done so; I'm speaking from
> detailed examination of the data) and state w
On 6/5/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> I'd love to get rid of full-width ASCII and halfwidth kana (via
> compatibility decomposition).
If you do forbid compatibility characters in identifiers, then they
should be flagged as an error, not converted silently. NFC, on the
other h
On 6/5/07, Talin <[EMAIL PROTECTED]> wrote:
> Thanks so much for this excellent roundup from the RoundUp Master :)
> Seriously, I've been staying well away from the PEP 3131 threads, and I
> was hoping that someone would post a summary of the issues so I could
> catch up.
I agree that the roundup
On 6/4/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> On 6/4/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> > However, what would that mean wrt. non-Unicode source encodings.
>
> > Say you have a Latin-1-encoded source code. Is that in NFC or not?
The path of least surprise for legacy encodings m
On 6/4/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> No, it can't. One might want to write Python code that implements
> normalization algorithms, for example, and there will be "binary
> strings". Only in the context of Unicode text are you allowed to
> do those things.
But Python files
On 6/3/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Sure - but how can Python tell whether a non-normalized string was
> intentionally put into the source, or as a side effect of the editor
> modifying it?
It can't, but does it really need to? It could always assume the latter.
> In most ca
On 6/3/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> Merely to define those is non-trivial, and it is absolutely out
> of the question to expect that the average Python user will know
> what the character set "strictly-conforms-to-UTR39-restrictions-
> allows-confusables" is.
This is a bit
(sorry about replying to so old mail, but I didn't find a better place
to put this)
On 5/1/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> All identifiers are converted into the normal form NFC while parsing;
Actually, shouldn't the whole file be converted to NFC, instead of
only identifiers?
On 6/3/07, Jim Jewett <[EMAIL PROTECTED]> wrote:
> On 6/2/07, Rauli Ruohonen <[EMAIL PROTECTED]> wrote:
> > # identifier_charset: 0-7f
>
> Why not ASCII?
> Why not be more specific, with 0x30-0x39, 0x41-0x5a, 0x5f, 0x61-0x7a
>
> When adding characters, this isn
On 6/2/07, Josiah Carlson <[EMAIL PROTECTED]> wrote:
> Whether or not there exists a tool to convert from Python 2.6 to
> Python 3.0 (2to3), every tool that currently handles Python source
> code encodings via the method specified in the documentation
> (just about every Python-centric editor I kno
On 6/2/07, Josiah Carlson <[EMAIL PROTECTED]> wrote:
> """
> If a comment in the first or second line of the Python script matches
> the regular expression coding[=:]\s*([-\w.]+), this comment is processed
> as an encoding declaration; the first group of this expression names the
> encoding of the
On 5/27/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
>James Y Knight writes:
>> a 'pyidchar.txt' file with a list of character ranges, and now that
>> pyidchar.txt file is going to have separate sections based on module
>> name? Sorry, but are you [EMAIL PROTECTED] kidding me?!?
>
>The scalab
30 matches
Mail list logo