On 2017-06-03 19:50, Thomas Jollans wrote:
On 03/06/17 18:48, Steven D'Aprano wrote:
On Sun, Jun 04, 2017 at 02:36:50AM +1000, Steven D'Aprano wrote:

But Python 3.5 does treat it as an identifier!

py> ℘ = 1  # should be a SyntaxError ?
py> ℘
1

There's a bug here, somewhere, I'm just not sure where...
That appears to be the only Symbol Math character which is accepted as an identifier in Python 3.5:

py> import unicodedata
py> all_unicode = map(chr, range(0x110000))
py> symbols = [c for c in all_unicode if unicodedata.category(c) == 'Sm']
py> len(symbols)
948
py> ns = {}
py> for c in symbols:
...     try:
...             exec(c + " = 1", ns)
...     except SyntaxError:
...             pass
...     else:
...             print(c, unicodedata.name(c))
...
℘ SCRIPT CAPITAL P
py>

This is actually not a bug in Python, but a quirk in Unicode.

I've had a closer look at PEP 3131 [1], which specifies that Python
identifiers follow the Unicode classes XID_Start and XID_Continue. ℘ is
listed in the standard [2][3] as XID_Start, so Python correctly accepts
it as an identifier.

import unicodedata
all_unicode = map(chr, range(0x110000))
for c in all_unicode:
...     category = unicodedata.category(c)
...     if not category.startswith('L') and category != 'Nl': # neither
letter nor letter-number
...         if c.isidentifier():
...             print('%s\tU+%04X\t%s' % (c, ord(c), unicodedata.name(c)))
...
_    U+005F    LOW LINE
℘    U+2118    SCRIPT CAPITAL P
℮    U+212E    ESTIMATED SYMBOL


℘ and ℮ are actually explicitly mentioned in the Unicode annnex [3]:


      2.5Backward Compatibility

Unicode General_Category values are kept as stable as possible, but
they can change across versions of the Unicode Standard. The bulk of
the characters having a given value are determined by other
properties, and the coverage expands in the future according to the
assignment of those properties. In addition, the Other_ID_Start
property provides a small list of characters that qualified as
ID_Start characters in some previous version of Unicode solely on the
basis of their General_Category properties, but that no longer qualify
in the current version. These are called /grandfathered/ characters.

The Other_ID_Start property includes characters such as the following:

    U+2118 ( ℘ ) SCRIPT CAPITAL P
    U+212E ( ℮ ) ESTIMATED SYMBOL
    U+309B ( ゛ ) KATAKANA-HIRAGANA VOICED SOUND MARK
    U+309C ( ゜ ) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK


I have no idea why U+309B and U+309C are not accepted as identifiers by
Python 3.5. This could be a question of Python following an old version
of the Unicode standard, or it *could* be a bug.

[snip]

U+309B and U+309C have had the property ID_Start since at least Unicode 6.0 (August 2010).

Interestingly, '_' doesn't have that property, although Python does allow identifiers to start with it.
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to