On 03/06/17 21:02, Thomas Jollans wrote: > On 03/06/17 20:41, Chris Angelico wrote: >> [snip] >> For reference, as well as the 948 Sm, there are 1690 Mn and 5777 So, >> but only these characters are valid from them: >> >> \u1885 Mn MONGOLIAN LETTER ALI GALI BALUDA >> \u1886 Mn MONGOLIAN LETTER ALI GALI THREE BALUDA >> ℘ Sm SCRIPT CAPITAL P >> ℮ So ESTIMATED SYMBOL >> >> 2118 SCRIPT CAPITAL P and 212E ESTIMATED SYMBOL are listed in >> PropList.txt as Other_ID_Start, so they make sense. But that doesn't >> explain the two characters from category Mn. It also doesn't explain >> why U+309B and U+309C are *not* valid, despite being declared >> Other_ID_Start. Maybe it's a bug? Maybe 309B and 309C somehow got >> switched into 1885 and 1886?? > \u1885 and \u1886 are categorised as letters (category Lo) by my Python > 3.5. (Which makes sense, right?) If your system puts them in category > Mn, that's bound to be a bug somewhere.
Actually it turns out that these characters were changed to category Mn in Unicode 9.0, but remain in (X)ID_Start for compatibility. All is right with the world. (All of this just goes to show how much subtlety there is in the science that goes into making Unicode) See: http://www.unicode.org/reports/tr44/tr44-18.html#Unicode_9.0.0 > > As for \u309B and \u309C - it turns out this is a question of > normalisation. PEP 3131 requires NFKC normalisation: > >>>> for c in unicodedata.normalize('NFKC', '\u309B'): > ... print('%s\tU+%04X\t%s' % (c, ord(c), unicodedata.name(c))) > ... > U+0020 SPACE > U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK >>>> for c in unicodedata.normalize('NFKC', '\u309C'): > ... print('%s\tU+%04X\t%s' % (c, ord(c), unicodedata.name(c))) > ... > U+0020 SPACE > U+309A COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK > This is.... interesting. > > > Thomas > > > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ -- Thomas Jollans m ☎ +31 6 42630259 e ✉ t...@tjol.eu _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/