On 03/06/17 18:48, Steven D'Aprano wrote: > On Sun, Jun 04, 2017 at 02:36:50AM +1000, Steven D'Aprano wrote: > >> But Python 3.5 does treat it as an identifier! >> >> py> ℘ = 1 # should be a SyntaxError ? >> py> ℘ >> 1 >> >> There's a bug here, somewhere, I'm just not sure where... > That appears to be the only Symbol Math character which is accepted as > an identifier in Python 3.5: > > py> import unicodedata > py> all_unicode = map(chr, range(0x110000)) > py> symbols = [c for c in all_unicode if unicodedata.category(c) == 'Sm'] > py> len(symbols) > 948 > py> ns = {} > py> for c in symbols: > ... try: > ... exec(c + " = 1", ns) > ... except SyntaxError: > ... pass > ... else: > ... print(c, unicodedata.name(c)) > ... > ℘ SCRIPT CAPITAL P > py>
This is actually not a bug in Python, but a quirk in Unicode. I've had a closer look at PEP 3131 [1], which specifies that Python identifiers follow the Unicode classes XID_Start and XID_Continue. ℘ is listed in the standard [2][3] as XID_Start, so Python correctly accepts it as an identifier. >>> import unicodedata >>> all_unicode = map(chr, range(0x110000)) >>> for c in all_unicode: ... category = unicodedata.category(c) ... if not category.startswith('L') and category != 'Nl': # neither letter nor letter-number ... if c.isidentifier(): ... print('%s\tU+%04X\t%s' % (c, ord(c), unicodedata.name(c))) ... _ U+005F LOW LINE ℘ U+2118 SCRIPT CAPITAL P ℮ U+212E ESTIMATED SYMBOL >>> ℘ and ℮ are actually explicitly mentioned in the Unicode annnex [3]: > > 2.5Backward Compatibility > > Unicode General_Category values are kept as stable as possible, but > they can change across versions of the Unicode Standard. The bulk of > the characters having a given value are determined by other > properties, and the coverage expands in the future according to the > assignment of those properties. In addition, the Other_ID_Start > property provides a small list of characters that qualified as > ID_Start characters in some previous version of Unicode solely on the > basis of their General_Category properties, but that no longer qualify > in the current version. These are called /grandfathered/ characters. > > The Other_ID_Start property includes characters such as the following: > > U+2118 ( ℘ ) SCRIPT CAPITAL P > U+212E ( ℮ ) ESTIMATED SYMBOL > U+309B ( ゛ ) KATAKANA-HIRAGANA VOICED SOUND MARK > U+309C ( ゜ ) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK > > I have no idea why U+309B and U+309C are not accepted as identifiers by Python 3.5. This could be a question of Python following an old version of the Unicode standard, or it *could* be a bug. Thomas [1] https://www.python.org/dev/peps/pep-3131/#specification-of-language-changes [2] http://www.unicode.org/Public/4.1.0/ucd/DerivedCoreProperties.txt [3] http://www.unicode.org/reports/tr31/ _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/