[issue18234] Unicodedata module should provide access to codepoint aliases

Marc-Andre Lemburg Mon, 24 Jun 2013 08:09:59 -0700

Marc-Andre Lemburg added the comment:

On 24.06.2013 16:58, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> Here is an example of "prior art" that is relevant to this discussion:
> 
> """
> charnames::viacode(code)
> ..
> As mentioned above under ALIASES, Unicode 6.1 defines extra names (synonyms 
> or aliases) for some code points, most of which were already available as 
> Perl extensions. All these are accepted by \N{...} and the other functions in 
> this module, but viacode has to choose which one name to return for a given 
> input code point, so it returns the "best" name. To understand how this 
> works, it is helpful to know more about the Unicode name properties. All code 
> points actually have only a single name, which (starting in Unicode 2.0) can 
> never change once a character has been assigned to the code point. But 
> mistakes have been made in assigning names, for example sometimes a clerical 
> error was made during the publishing of the Standard which caused words to be 
> misspelled, and there was no way to correct those. The Name_Alias property 
> was eventually created to handle these situations. If a name was wrong, a 
> corrected synonym would be published for it, using Name_Alias. viacode will 
> return
  t
>  hat corr
>  ected synonym as the "best" name for a code point. (It is even possible, 
> though it hasn't happened yet, that the correction itself will need to be 
> corrected, and so another Name_Alias can be created for that code point; 
> viacode will return the most recent correction.)
> 
> The Unicode name for each of the control characters (such as LINE FEED) is 
> the empty string. However almost all had names assigned by other standards, 
> such as the ASCII Standard, or were in common use. viacode returns these 
> names as the "best" ones available. Unicode 6.1 has created Name_Aliases for 
> each of them, including alternate names, like NEW LINE. viacode uses the 
> original name, "LINE FEED" in preference to the alternate. Similarly the name 
> returned for U+FEFF is "ZERO WIDTH NO-BREAK SPACE", not "BYTE ORDER MARK".
> """ <http://perldoc.perl.org/charnames.html#charnames%3a%3aviacode(code)>
> 
> If .name() cannot be touched, what about implementing .bestname() with the 
> above semantics?


I think it's better to let the programmer decide what the "best"
name should be, e.g. some people will like ESC better than ESCAPE or
\u001b or \x1b.

unicodedata only provides neutral access to what's in the Unicode database.
It doesn't make any decisions on what's good or bad ;-)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18234>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18234] Unicodedata module should provide access to codepoint aliases

Reply via email to