Re: Unicode script

Terry Reedy Thu, 15 Dec 2016 11:49:12 -0800

On 12/15/2016 11:53 AM, Steve D'Aprano wrote:

Suppose I have a Unicode character, and I want to determine the script or
scripts it belongs to.


For example:

U+0033 DIGIT THREE "3" belongs to the script "COMMON";
U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
U+03BE GREEK SMALL LETTER XI "ξ" belongs to the script "GREEK".

Is this information available from Python?

Yes, though not as nicely as you probably want. (Have you searched forexisting 3rd party modules?) As near as I can tell, there is no direct'script' property in the unicodedatabase.


Option 1: unicodedata module, from char name

>>> import unicodedata as ucd
>>> ucd.name('\u03be')
'GREEK SMALL LETTER XI'
>>> ucd.name('\u0061')
'LATIN SMALL LETTER A'

In most cases, the non-common char names start with a script name.
In some cases, the script name is 2 or w words.

>>> ucd.name('\U00010A60')
'OLD SOUTH ARABIAN LETTER HE'

In a few cases, the script name is embedded in the name.
>>> ucd.name('\U0001F200')
'SQUARE HIRAGANA HOKA'

Occasionally the script name is omitted.
>>> ucd.name('\u3300')
'SQUARE APAATO'  # Katakana

To bad the Unicode Consortium did not use a consistent name scheme:
script [, subscript]: character

LATIN: SMALL LETTER A
ARABIAN, OLD SOUTH: LETTER HE
KATAKANA: SQUARE APAATO

More about Unicode scripts:

http://www.unicode.org/reports/tr24/
http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt


Option 2: Fetch the above Scripts.txt.

Suboption 1: Turn Scripts.txt into a list of lines. The lines could becondensed to codepoint or codepoint range, script. Write a functionthat takes a character or codepoint and linearly scans the list for amatching line. This makes each lookup O(number-of-lines).

Suboption 2. Turn Scripts.txt into a list of scripts, with codepointbeing the index. This takes more preparation, but makes each lookupO(1). Once the preparation is done, the list could be turned into atuple and saved as a .py file, with the tuple being a compiled constantin a .pyc file.

To avoid bloat, make sure that multiple entries for a script use thesame string object instead of multiple equal strings. (CPython stringinterning might do this automatically, but cross-implementation codeshould not depend on this.) The difference is


scripts = [..., 'Han', 'Han', 'Han', ...] # multiple strings
versus
HAN = 'Han'
scripts = [..., HAN, HAN, HAN, ...]  # multiple references to one string

On a 64 bit OS, the latter would use 8 x defined codepoints (about200,000) bytes. Assuming such does not already exits, it might be worthmaking such a module available on PyPI.

http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt

Essentially, ditto, except that I would use a dict rather than asequence as there are only about 400 codepoints involved.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode script

Reply via email to