On 12/15/2016 11:53 AM, Steve D'Aprano wrote:
Suppose I have a Unicode character, and I want to determine the script or
scripts it belongs to.

For example:

U+0033 DIGIT THREE "3" belongs to the script "COMMON";
U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
U+03BE GREEK SMALL LETTER XI "ΞΎ" belongs to the script "GREEK".

Is this information available from Python?



Yes, though not as nicely as you probably want. (Have you searched for existing 3rd party modules?) As near as I can tell, there is no direct 'script' property in the unicodedatabase.

Option 1: unicodedata module, from char name

>>> import unicodedata as ucd
>>> ucd.name('\u03be')
'GREEK SMALL LETTER XI'
>>> ucd.name('\u0061')
'LATIN SMALL LETTER A'

In most cases, the non-common char names start with a script name.
In some cases, the script name is 2 or w words.

>>> ucd.name('\U00010A60')
'OLD SOUTH ARABIAN LETTER HE'

In a few cases, the script name is embedded in the name.
>>> ucd.name('\U0001F200')
'SQUARE HIRAGANA HOKA'

Occasionally the script name is omitted.
>>> ucd.name('\u3300')
'SQUARE APAATO'  # Katakana

To bad the Unicode Consortium did not use a consistent name scheme:
script [, subscript]: character

LATIN: SMALL LETTER A
ARABIAN, OLD SOUTH: LETTER HE
KATAKANA: SQUARE APAATO

More about Unicode scripts:

http://www.unicode.org/reports/tr24/
http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt

Option 2: Fetch the above Scripts.txt.

Suboption 1: Turn Scripts.txt into a list of lines. The lines could be condensed to codepoint or codepoint range, script. Write a function that takes a character or codepoint and linearly scans the list for a matching line. This makes each lookup O(number-of-lines).

Suboption 2. Turn Scripts.txt into a list of scripts, with codepoint being the index. This takes more preparation, but makes each lookup O(1). Once the preparation is done, the list could be turned into a tuple and saved as a .py file, with the tuple being a compiled constant in a .pyc file.

To avoid bloat, make sure that multiple entries for a script use the same string object instead of multiple equal strings. (CPython string interning might do this automatically, but cross-implementation code should not depend on this.) The difference is

scripts = [..., 'Han', 'Han', 'Han', ...] # multiple strings
versus
HAN = 'Han'
scripts = [..., HAN, HAN, HAN, ...]  # multiple references to one string

On a 64 bit OS, the latter would use 8 x defined codepoints (about 200,000) bytes. Assuming such does not already exits, it might be worth making such a module available on PyPI.

http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt

Essentially, ditto, except that I would use a dict rather than a sequence as there are only about 400 codepoints involved.

--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to