On Tue, Jun 30, 2020 at 09:04:15PM +0300, Mikhail V wrote:

> > Counter-proposal: hex escapes allow optional curly brackets, similar to
> > unicode name escapes. You could even allow spaces within the braces, for
> > grouping:
> >
> >     # Proposed enhancement:
> >     "\x{2b}2c"  # '+2c'
> >     "\x{2b2c}"  # '+,'
> >     "\x{DEAD BEEF}"  # "\xDE\xAD\xBE\xEF"
> 
> Nice. But I am not sure about the data type and interpretation depending
> on string type. E.g. the second example:
> 
>      "\x{2b2c}"  # '+,'
> 
> In my example I was showing hex codepoints, e.g.  U+2b2c is  ⬬ (Black
> Horizontal Ellipse)

Your example used the `\x` escape, which takes a pair of hex digits 
between 0 and 255 inclusive (`\x00` to `\xFF`) and returns a single 
unicode character between `\u0000` and `\u00FF`. You cannot use x 
escapes to build up higher unicode code points in a string:

    '\x2b\x2c' != '\u2b2c'

So I assumed that you wanted a way to include multiple such escapes in a 
sequence. If you want the horizontal ellipse, don't use an `\x` escape, 
it is the wrong one! Use `\u2b2c`.

I have no interest in making `\x{2b2c}` an alternative way of writing 
`\u2b2c`. Just use the u (or U) escape instead of x.

I have no objection to adding the same braces to unicode u and U 
escapes. Inside the braces, spaces and underscores can be just ignored 
(they are there for visual grouping).

(1) Byte strings support optional braces, spaces and underscores for 
grouping in hex escapes:

    b'\x{2b 2c_2a}' == b'\x2b\x2c\x2a' == b'+,*'

The spaces/underscores can appear anywhere within the braces, in any 
order. "Consenting adults" apply:

    # Valid, but don't do this.
    b'\x{      2      ___ _ ___       b     }'

Style guides and linters can warn against writing ugly strings :-)


(2) Unicode strings support the same, with the equivalent semantics:

    '\x{2b 2c_2a}' == '\x2b\x2c\x2a' == '+,*'


(3) Similarly Unicode strings support optional braces and grouping for u 
and U escapes:

    '\u{2b 2c}' == '\u2b2c' == '\N{BLACK HORIZONTAL ELLIPSE}'
    '\U{0000 2b2c}' == '\U00002b2c' == '\N{BLACK HORIZONTAL ELLIPSE}'

Likewise any combination of spaces and underscores, in any order, are 
valid. We can write hideous strings if we want :-)

    # Valid but don't do this.
    '\U{  __ 0 __0__   0 0 2_b  2 ___c___ }'


Unlike x escapes, I don't think we should support multiple code points 
within the u and U braces:

    # Not part of the proposal
    '\u{221a221e}' == '\N{SQUARE ROOT}\N{INFINITY}'

My reasoning for this is that the leading `\x` is proportionally very 
"heavy" for hex escapes: fifty percent of the escape code is made up by 
the leading `\x`, versus just 33% for u escapes and 20% for U escapes. 
So there is much less benefit to grouping multiple u and U escapes in a 
single set of braces.

The other reason why grouping u and U escapes is less useful is that 
often we can just include the literal unicode character as a string:

    '√∞'

whereas you cannot do so for control characters. So my argument is to 
make the conservative change and only allow multiple escape codes inside 
braces for x escapes.

(We can relax the restriction later if there is demand for it, but we 
cannot tighten it if we change our mind.)


Likewise, I would prefer the conservative approach of still requiring 
leading zeroes in u and U escapes.


(4) Lastly, f-strings support the same rules as unicode strings.



-- 
Steven
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/473YKBKZMOH2FNMNDUOMD263VEJ3HH66/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to