Greg Price <gnpr...@gmail.com> added the comment:

I've gone and implemented a version of this that's integrated into 
Tools/unicode/makeunicodedata.py , and into the unicodedata module.  Patch 
attached.  Demo:

>>> import unicodedata, pprint
>>> pprint.pprint(unicodedata.property_value_aliases)
{'bidirectional': {'AL': ['Arabic_Letter'],
# ...
                   'WS': ['White_Space']},
 'category': {'C': ['Other'],
# ...
 'east_asian_width': {'A': ['Ambiguous'],
# ...
                      'W': ['Wide']}}


Note that the values are lists.  That's because a value can have multiple 
aliases in addition to its "short name":

>>> unicodedata.property_value_aliases['category'][unicodedata.category('4')]
['Decimal_Number', 'digit']


This implementation also provides the reverse mapping, from an alias to the 
"short name":

>>> pprint.pprint(unicodedata.property_value_by_alias)
{'bidirectional': {'Arabic_Letter': 'AL',
# ...


This draft doesn't have tests or docs, but it's otherwise complete. I've posted 
it at this stage for feedback on a few open questions:

* This version is in C; at import time some C code builds up the dicts, from 
static tables in the header generated by makeunicodedata.py .  It's not *that* 
much code... but it sure would be more convenient to do in Python instead.

  Should the unicodedata module perhaps have a Python part?  I'd be happy to go 
about that -- rename the existing C module to _unicodedata and add a small 
unicodedata.py wrapper -- if there's a feeling that it'd be a good idea.  Then 
this could go there instead of using the C code I've just written.


* Is this API the right one?
  * This version has e.g. 
unicodedata.property_value_by_alias['category']['Decimal_Number'] == 'Nd' .

  * Perhaps make category/bidirectional/east_asian_width into attributes rather 
than keys? So e.g. 
unicodedata.property_value_by_alias.category['Decimal_Number'] == 'Nd' .

  * Or: the standard says "loose matching" should be applied to these names, so 
e.g. 'decimal number' or 'is-decimal-number' is equivalent to 'Decimal_Number'. 
To accomplish that, perhaps make it not dicts at all but functions?

    So e.g. unicodedata.property_value_by_alias('decimal number') == 
unicodedata.property_value_by_alias('Decimal_Number') == 'Nd' .

  * There's also room for bikeshedding on the names.


* How shall we handle ucd_3_2_0 for this feature?

  This implementation doesn't attempt to record the older version of the data.  
My reasoning is that because the applications of the old data are quite 
specific and they haven't needed this information yet, it seems unlikely anyone 
will ever really want to know from this module just which aliases existed 
already in 3.2.0 and which didn't yet.

  OTOH, as a convenience I've caused e.g. 
unicodedata.ucd_3_2_0.property_value_by_alias to exist, just pointing to the 
same object as unicodedata.property_value_by_alias . This allows 
unicodedata.ucd_3_2_0 to remain a near drop-in substitute for the unicodedata 
module itself, while minimizing the complexity it adds to the implementation.

  Might be cleanest to just leave these off of ucd_3_2_0 entirely, though. It's 
still easy to get at them -- just get them from the module itself -- and it 
makes it explicit that you're getting current rather than old data.

----------
keywords: +patch
nosy: +Greg Price
versions: +Python 3.9 -Python 3.8
Added file: https://bugs.python.org/file48616/prop-val-aliases.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue16684>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to