On 2017-06-03 19:50, Thomas Jollans wrote:
On 03/06/17 18:48, Steven D'Aprano wrote:
On Sun, Jun 04, 2017 at 02:36:50AM +1000, Steven D'Aprano wrote:
But Python 3.5 does treat it as an identifier!
py> ℘ = 1 # should be a SyntaxError ?
py> ℘
1
There's a bug here, somewhere, I'm just not sure where...
That appears to be the only Symbol Math character which is accepted as
an identifier in Python 3.5:
py> import unicodedata
py> all_unicode = map(chr, range(0x110000))
py> symbols = [c for c in all_unicode if unicodedata.category(c) == 'Sm']
py> len(symbols)
948
py> ns = {}
py> for c in symbols:
... try:
... exec(c + " = 1", ns)
... except SyntaxError:
... pass
... else:
... print(c, unicodedata.name(c))
...
℘ SCRIPT CAPITAL P
py>
This is actually not a bug in Python, but a quirk in Unicode.
I've had a closer look at PEP 3131 [1], which specifies that Python
identifiers follow the Unicode classes XID_Start and XID_Continue. ℘ is
listed in the standard [2][3] as XID_Start, so Python correctly accepts
it as an identifier.
import unicodedata
all_unicode = map(chr, range(0x110000))
for c in all_unicode:
... category = unicodedata.category(c)
... if not category.startswith('L') and category != 'Nl': # neither
letter nor letter-number
... if c.isidentifier():
... print('%s\tU+%04X\t%s' % (c, ord(c), unicodedata.name(c)))
...
_ U+005F LOW LINE
℘ U+2118 SCRIPT CAPITAL P
℮ U+212E ESTIMATED SYMBOL
℘ and ℮ are actually explicitly mentioned in the Unicode annnex [3]:
2.5Backward Compatibility
Unicode General_Category values are kept as stable as possible, but
they can change across versions of the Unicode Standard. The bulk of
the characters having a given value are determined by other
properties, and the coverage expands in the future according to the
assignment of those properties. In addition, the Other_ID_Start
property provides a small list of characters that qualified as
ID_Start characters in some previous version of Unicode solely on the
basis of their General_Category properties, but that no longer qualify
in the current version. These are called /grandfathered/ characters.
The Other_ID_Start property includes characters such as the following:
U+2118 ( ℘ ) SCRIPT CAPITAL P
U+212E ( ℮ ) ESTIMATED SYMBOL
U+309B ( ゛ ) KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C ( ゜ ) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
I have no idea why U+309B and U+309C are not accepted as identifiers by
Python 3.5. This could be a question of Python following an old version
of the Unicode standard, or it *could* be a bug.
[snip]
U+309B and U+309C have had the property ID_Start since at least Unicode
6.0 (August 2010).
Interestingly, '_' doesn't have that property, although Python does
allow identifiers to start with it.
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/