On 12/15/2016 11:53 AM, Steve D'Aprano wrote:
Suppose I have a Unicode character, and I want to determine the script or
scripts it belongs to.
For example:
U+0033 DIGIT THREE "3" belongs to the script "COMMON";
U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
U+03BE GREEK SMALL LETTER XI "ΞΎ" belongs to the script "GREEK".
Is this information available from Python?
Yes, though not as nicely as you probably want. (Have you searched for
existing 3rd party modules?) As near as I can tell, there is no direct
'script' property in the unicodedatabase.
Option 1: unicodedata module, from char name
>>> import unicodedata as ucd
>>> ucd.name('\u03be')
'GREEK SMALL LETTER XI'
>>> ucd.name('\u0061')
'LATIN SMALL LETTER A'
In most cases, the non-common char names start with a script name.
In some cases, the script name is 2 or w words.
>>> ucd.name('\U00010A60')
'OLD SOUTH ARABIAN LETTER HE'
In a few cases, the script name is embedded in the name.
>>> ucd.name('\U0001F200')
'SQUARE HIRAGANA HOKA'
Occasionally the script name is omitted.
>>> ucd.name('\u3300')
'SQUARE APAATO' # Katakana
To bad the Unicode Consortium did not use a consistent name scheme:
script [, subscript]: character
LATIN: SMALL LETTER A
ARABIAN, OLD SOUTH: LETTER HE
KATAKANA: SQUARE APAATO
More about Unicode scripts:
http://www.unicode.org/reports/tr24/
http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
Option 2: Fetch the above Scripts.txt.
Suboption 1: Turn Scripts.txt into a list of lines. The lines could be
condensed to codepoint or codepoint range, script. Write a function
that takes a character or codepoint and linearly scans the list for a
matching line. This makes each lookup O(number-of-lines).
Suboption 2. Turn Scripts.txt into a list of scripts, with codepoint
being the index. This takes more preparation, but makes each lookup
O(1). Once the preparation is done, the list could be turned into a
tuple and saved as a .py file, with the tuple being a compiled constant
in a .pyc file.
To avoid bloat, make sure that multiple entries for a script use the
same string object instead of multiple equal strings. (CPython string
interning might do this automatically, but cross-implementation code
should not depend on this.) The difference is
scripts = [..., 'Han', 'Han', 'Han', ...] # multiple strings
versus
HAN = 'Han'
scripts = [..., HAN, HAN, HAN, ...] # multiple references to one string
On a 64 bit OS, the latter would use 8 x defined codepoints (about
200,000) bytes. Assuming such does not already exits, it might be worth
making such a module available on PyPI.
http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt
Essentially, ditto, except that I would use a dict rather than a
sequence as there are only about 400 codepoints involved.
--
Terry Jan Reedy
--
https://mail.python.org/mailman/listinfo/python-list