Greg Price <[email protected]> added the comment:
I've gone and implemented a version of this that's integrated into
Tools/unicode/makeunicodedata.py , and into the unicodedata module. Patch
attached. Demo:
>>> import unicodedata, pprint
>>> pprint.pprint(unicodedata.property_value_aliases)
{'bidirectional': {'AL': ['Arabic_Letter'],
# ...
'WS': ['White_Space']},
'category': {'C': ['Other'],
# ...
'east_asian_width': {'A': ['Ambiguous'],
# ...
'W': ['Wide']}}
Note that the values are lists. That's because a value can have multiple
aliases in addition to its "short name":
>>> unicodedata.property_value_aliases['category'][unicodedata.category('4')]
['Decimal_Number', 'digit']
This implementation also provides the reverse mapping, from an alias to the
"short name":
>>> pprint.pprint(unicodedata.property_value_by_alias)
{'bidirectional': {'Arabic_Letter': 'AL',
# ...
This draft doesn't have tests or docs, but it's otherwise complete. I've posted
it at this stage for feedback on a few open questions:
* This version is in C; at import time some C code builds up the dicts, from
static tables in the header generated by makeunicodedata.py . It's not *that*
much code... but it sure would be more convenient to do in Python instead.
Should the unicodedata module perhaps have a Python part? I'd be happy to go
about that -- rename the existing C module to _unicodedata and add a small
unicodedata.py wrapper -- if there's a feeling that it'd be a good idea. Then
this could go there instead of using the C code I've just written.
* Is this API the right one?
* This version has e.g.
unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' .
* Perhaps make category/bidirectional/east_asian_width into attributes rather
than keys? So e.g.
unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' .
* Or: the standard says "loose matching" should be applied to these names, so
e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'.
To accomplish that, perhaps make it not dicts at all but functions?
So e.g. unicodedata.property_value_by_alias('decimal number') ==
unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' .
* There's also room for bikeshedding on the names.
* How shall we handle ucd_3_2_0 for this feature?
This implementation doesn't attempt to record the older version of the data.
My reasoning is that because the applications of the old data are quite
specific and they haven't needed this information yet, it seems unlikely anyone
will ever really want to know from this module just which aliases existed
already in 3.2.0 and which didn't yet.
OTOH, as a convenience I've caused e.g.
unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the
same object as unicodedata.property_value_by_alias . This allows
unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata
module itself, while minimizing the complexity it adds to the implementation.
Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's
still easy to get at them -- just get them from the module itself -- and it
makes it explicit that you're getting current rather than old data.
----------
keywords: +patch
nosy: +Greg Price
versions: +Python 3.9 -Python 3.8
Added file: https://bugs.python.org/file48616/prop-val-aliases.patch
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue16684>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com