[issue18234] Unicodedata module should provide access to codepoint aliases

2014-10-11 Thread flying sheep

flying sheep added the comment:

IDK if it came with unicode 7.0, but there is clarification:

# Note that currently the only instances of multiple aliases of the same
# type for a single code point are either of type "control" or "abbreviation".
# An alias of type "abbreviation" can, in principle, be added for any code
# point, although currently aliases of type "correction" do not have
# any additional aliases of type "abbreviation". Such relationships
# are not enforced by stability policies.

it says “currently”, so it isn’t guaranteed to stay that way, and other types 
could also be specified multiple times in the future.

so as much as i’d like it if we could follow Alexander’s proposal, i think we 
shouldn’t extend the function that way if it would either return a name string, 
a default value, a list of aliases, or raise an exception: too complex.

i think we should create:

unicodedata.aliases(chr, 
type=(None|'correction'|'control'|'alternate'|'figment'|'abbreviation'))

and make

aliases(chr) return a dict with all aliases for the character, and make
aliases(chr, type) return a list of aliases for that type (possibly empty)

examples:

aliases('\b') == {'control': ['BACKSPACE'], 'abbreviation': ['BS']}
aliases('\b', 'control') == ['BACKSPACE']
aliases('b') == {}
aliases('b', 'control') == []

---

alternative: when specifying a type, it’ll raise an error if no alias of this 
type exists. but because of the sparse nature of aliases i’m against that.

--
nosy: +flying sheep

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2014-02-10 Thread Ezio Melotti

Ezio Melotti added the comment:

See also #20433.

--
stage:  -> needs patch
versions: +Python 3.5 -Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 24.06.2013 18:10, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
>> The .aliases() function would have to return a list, not a single
>> name, so a parameter would cause the return type to change, which
>> is not a good idea.
> 
> You misunderstood my proposal.  .name() will still return a single name, but 
> the type parameter will control which name to return:
> 
> name(ch[, 
> type=(None|'correction'|'control'|'alternate'|'figment'|'abbreviation')])
> 
> None - default, same as current behavior.
> 
> correction - indicates that the returned name is a corrected form for the 
> original name (which remains valid) for the same code point.
> 
> control - return a new name added for a control character.
> 
> alternate - return an alternate name for a character
> 
> figment - return a name for a character that has been documented but was 
> never in any actual standard.
> 
> abbreviation - return a common abbreviation for a character

How can you be sure that each of those alias types occurs only
once ?

The NameAliases.txt doesn't say anything about this, AFAIK:

http://www.unicode.org/Public/UNIDATA/NameAliases.txt

Also, what would name() return in case to alias of a particular
type is defined ?

I think it would be easier and more future proof to have a function
aliases(code) -> [(type, alias),...] which simply returns all
defined aliases. Applications could then add helpers for
select the type they would like to use.

It may make sense to also add the name(code) value as
e.g. ('standard', name(code)) to that list.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jun 24 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2013-06-18: Released mxODBC Django DE 1.2.0 ...   http://egenix.com/go47
2013-07-01: EuroPython 2013, Florence, Italy ...7 days to go
2013-07-16: Python Meeting Duesseldorf ... 22 days to go

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-24 Thread Martin v . Löwis

Martin v. Löwis added the comment:

But some of these types could still have lists as values, no?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-24 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

> The .aliases() function would have to return a list, not a single
> name, so a parameter would cause the return type to change, which
> is not a good idea.

You misunderstood my proposal.  .name() will still return a single name, but 
the type parameter will control which name to return:

name(ch[, 
type=(None|'correction'|'control'|'alternate'|'figment'|'abbreviation')])

None - default, same as current behavior.

correction - indicates that the returned name is a corrected form for the 
original name (which remains valid) for the same code point.

control - return a new name added for a control character.

alternate - return an alternate name for a character

figment - return a name for a character that has been documented but was never 
in any actual standard.

abbreviation - return a common abbreviation for a character

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 24.06.2013 16:58, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> Here is an example of "prior art" that is relevant to this discussion:
> 
> """
> charnames::viacode(code)
> ..
> As mentioned above under ALIASES, Unicode 6.1 defines extra names (synonyms 
> or aliases) for some code points, most of which were already available as 
> Perl extensions. All these are accepted by \N{...} and the other functions in 
> this module, but viacode has to choose which one name to return for a given 
> input code point, so it returns the "best" name. To understand how this 
> works, it is helpful to know more about the Unicode name properties. All code 
> points actually have only a single name, which (starting in Unicode 2.0) can 
> never change once a character has been assigned to the code point. But 
> mistakes have been made in assigning names, for example sometimes a clerical 
> error was made during the publishing of the Standard which caused words to be 
> misspelled, and there was no way to correct those. The Name_Alias property 
> was eventually created to handle these situations. If a name was wrong, a 
> corrected synonym would be published for it, using Name_Alias. viacode will 
> return
  t
>  hat corr
>  ected synonym as the "best" name for a code point. (It is even possible, 
> though it hasn't happened yet, that the correction itself will need to be 
> corrected, and so another Name_Alias can be created for that code point; 
> viacode will return the most recent correction.)
> 
> The Unicode name for each of the control characters (such as LINE FEED) is 
> the empty string. However almost all had names assigned by other standards, 
> such as the ASCII Standard, or were in common use. viacode returns these 
> names as the "best" ones available. Unicode 6.1 has created Name_Aliases for 
> each of them, including alternate names, like NEW LINE. viacode uses the 
> original name, "LINE FEED" in preference to the alternate. Similarly the name 
> returned for U+FEFF is "ZERO WIDTH NO-BREAK SPACE", not "BYTE ORDER MARK".
> """ 
> 
> If .name() cannot be touched, what about implementing .bestname() with the 
> above semantics?

I think it's better to let the programmer decide what the "best"
name should be, e.g. some people will like ESC better than ESCAPE or
\u001b or \x1b.

unicodedata only provides neutral access to what's in the Unicode database.
It doesn't make any decisions on what's good or bad ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-24 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

Here is an example of "prior art" that is relevant to this discussion:

"""
charnames::viacode(code)
..
As mentioned above under ALIASES, Unicode 6.1 defines extra names (synonyms or 
aliases) for some code points, most of which were already available as Perl 
extensions. All these are accepted by \N{...} and the other functions in this 
module, but viacode has to choose which one name to return for a given input 
code point, so it returns the "best" name. To understand how this works, it is 
helpful to know more about the Unicode name properties. All code points 
actually have only a single name, which (starting in Unicode 2.0) can never 
change once a character has been assigned to the code point. But mistakes have 
been made in assigning names, for example sometimes a clerical error was made 
during the publishing of the Standard which caused words to be misspelled, and 
there was no way to correct those. The Name_Alias property was eventually 
created to handle these situations. If a name was wrong, a corrected synonym 
would be published for it, using Name_Alias. viacode will return that corr
 ected synonym as the "best" name for a code point. (It is even possible, 
though it hasn't happened yet, that the correction itself will need to be 
corrected, and so another Name_Alias can be created for that code point; 
viacode will return the most recent correction.)

The Unicode name for each of the control characters (such as LINE FEED) is the 
empty string. However almost all had names assigned by other standards, such as 
the ASCII Standard, or were in common use. viacode returns these names as the 
"best" ones available. Unicode 6.1 has created Name_Aliases for each of them, 
including alternate names, like NEW LINE. viacode uses the original name, "LINE 
FEED" in preference to the alternate. Similarly the name returned for U+FEFF is 
"ZERO WIDTH NO-BREAK SPACE", not "BYTE ORDER MARK".
""" 

If .name() cannot be touched, what about implementing .bestname() with the 
above semantics?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 24.06.2013 16:35, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> MAL> Please leave the function as it is, i.e. a 1-1 mapping to the
> MAL> official, non-changing Unicode name reference (including
> MAL> spelling errors, etc). Same with code points that have no name.
> 
> Since we have code points with no name - it is not 1-1 mapping but 1 to 0 or 
> 1.

True, it's not 1-1 in the mathematical sense (bijective), only surjective.
However, it is 1-1 for all code points which have a name assigned.

> Unicode Standard recommends using "Code Point Labels" "To provide unique, 
> meaningful labels for code points that do not have character names." (Section 
> 4.9.)
> 
> These labels are not very useful:
> 
> Control: control-
> Reserved: reserved-
> Noncharacter: noncharacter-
> Private-Use: private-use-
> Surrogate: surrogate-

I don't any advantage of using these over plain \u codes.

> According to the description in NameAliases.txt:
> 
> # The formal name aliases are part of the Unicode character namespace, which
> # includes the character names and the names of named character sequences.
> 
> I believe this means that formal name aliases are as official as the 
> character names.

Yes, but they are official aliases, not official code point names :-)

> If we don't change the default, what is the downside in adding an optional 
> type argument to unicodedata.name()?  After all, according to the standard, 
> aliases *are* names, just a different *type* of names.

The .aliases() function would have to return a list, not a single
name, so a parameter would cause the return type to change, which
is not a good idea.

A new function also makes the origin of these names clear to the
user.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-24 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

MAL> Please leave the function as it is, i.e. a 1-1 mapping to the
MAL> official, non-changing Unicode name reference (including
MAL> spelling errors, etc). Same with code points that have no name.

Since we have code points with no name - it is not 1-1 mapping but 1 to 0 or 1.

Unicode Standard recommends using "Code Point Labels" "To provide unique, 
meaningful labels for code points that do not have character names." (Section 
4.9.)

These labels are not very useful:

Control: control-
Reserved: reserved-
Noncharacter: noncharacter-
Private-Use: private-use-
Surrogate: surrogate-

According to the description in NameAliases.txt:

# The formal name aliases are part of the Unicode character namespace, which
# includes the character names and the names of named character sequences.

I believe this means that formal name aliases are as official as the character 
names.

If we don't change the default, what is the downside in adding an optional type 
argument to unicodedata.name()?  After all, according to the standard, aliases 
*are* names, just a different *type* of names.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 24.06.2013 10:05, Serhiy Storchaka wrote:
> 
> Serhiy Storchaka added the comment:
> 
> Perhaps unicodedata.aliases() should return not a list, but an ordered dict.
> 
> What name should use the "namereplace" error handler? Original or corrected? 
> Should it use first alias if there is no original name?

For compatibility with other tools, it should use .name(), not .aliases()
to determine the name. Please note that the aliases are not the official
Unicode names of the code points.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-24 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Perhaps unicodedata.aliases() should return not a list, but an ordered dict.

What name should use the "namereplace" error handler? Original or corrected? 
Should it use first alias if there is no original name?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-24 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 23.06.2013 22:43, Alexander Belopolsky wrote:
> 
> Alexander Belopolsky added the comment:
> 
> unicodedata.name() was discussed in #12353 (msg144739) where MvL argued that 
> misspelled names are better than corrected because they are more likely to 
> appear misspelled in other sources.  I am not sure I buy this argument.  
> Someone googling for 'BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS' 
> will probably just enter BYZANTINE VASIS and find what he or she needs.  A 
> more likely scenario is someone trying to get all FTHORA symbols using a 
> naive code like this: [hex(i) for i in range(1114112) if 'FTHORA' in 
> ud.name(chr(i), '')].
> 
> Even more likely scenario is someone seeing a fancy symbol on the web and 
> wanting to use it in a python program.  Such programmer would copy the symbol 
> to python prompt, call unicode.name() and copy the result in the program.  Do 
> we want to encourage people to perpetuate the mistake that Unicode has 
> corrected?
> 
> I don't think the issue of control codes names was discussed in #12353.  I 
> see no downside with returning the first alias in case no name is present.

We should stick to the rules. Please leave the function as it
is, i.e. a 1-1 mapping to the official, non-changing Unicode
name reference (including spelling errors, etc). Same with
code points that have no name.

If you want to expose the aliases, you can do so in a new
function, say .aliases() which then returns the list of
aliases of a character (including the original name,
if available).

If we change the return values of .name() to whatever we think
would be more usable, we'd be modifying how Python programmers
see the Unicode database. That's not the purpose of the module.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-23 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

I mistyped issue reference above it should be #12753, not 12353.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-23 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

unicodedata.name() was discussed in #12353 (msg144739) where MvL argued that 
misspelled names are better than corrected because they are more likely to 
appear misspelled in other sources.  I am not sure I buy this argument.  
Someone googling for 'BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS' 
will probably just enter BYZANTINE VASIS and find what he or she needs.  A more 
likely scenario is someone trying to get all FTHORA symbols using a naive code 
like this: [hex(i) for i in range(1114112) if 'FTHORA' in ud.name(chr(i), '')].

Even more likely scenario is someone seeing a fancy symbol on the web and 
wanting to use it in a python program.  Such programmer would copy the symbol 
to python prompt, call unicode.name() and copy the result in the program.  Do 
we want to encourage people to perpetuate the mistake that Unicode has 
corrected?

I don't think the issue of control codes names was discussed in #12353.  I see 
no downside with returning the first alias in case no name is present.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-23 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

> Can a character or sequence have multiple aliases?

Yes, for example, most control characters have two aliases (and no name).

;NULL;control
;NUL;abbreviation
0001;START OF HEADING;control
0001;SOH;abbreviation
0002;START OF TEXT;control
0002;STX;abbreviation

(See )

> What will be a result type of unicodedata.name() with "abbreviation" keyword 
> value?

Under my proposal:

>>> unicodedata.name('\N{ESCAPE}', type='abbreviation')
'ESC'

I would also like to consider changing the default slightly.  I find the 
following behavior rather unhelpful:

>>> unicodedata.name('\N{ESC}')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: no such name

I think most users would expect 'ESCAPE' instead.

The following is more of a curiosity rather than a genuine problem, but is a 
good illustration for a general point:

>>> unicodedata.name('\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR 
>>> BRACKET}')
'PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET'

(Note misspelled word "BRACKET" in the output.)

Since "correction" alias is the official method of publishing corrections to 
unicode names, I think unicodedata.name() should return correct name by default.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-23 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Can a character or sequence have multiple aliases? What will be a result type 
of unicodedata.name() with "abbreviation" keyword value?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-23 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

Rather than adding a new method to unicodedata, what do you think about adding 
a type keyword argument to unicodedata.name()?  It can default to "canonical" 
and have possible values "control", "abbreviation", etc.

See also #12753.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-20 Thread Alexander Belopolsky

Alexander Belopolsky added the comment:

UCD provides more than just a list of aliases: formal name aliases have "type" 
- control, abbreviation, etc.  See 
.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-20 Thread Martin v . Löwis

Martin v. Löwis added the comment:

I think the best way would be to provide a function unicodedata.aliases, 
returning a list of names for a given character or sequence.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-17 Thread Antoine Pitrou

Changes by Antoine Pitrou :


--
nosy: +benjamin.peterson, ezio.melotti, lemburg, loewis, serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18234] Unicodedata module should provide access to codepoint aliases

2013-06-16 Thread Alexander Belopolsky

New submission from Alexander Belopolsky:

Python is aware of unicode codepoint aliases, but unicodedata does not provide 
a way to find aliases of a given codepoint:

>>> ucd.lookup('ESCAPE') == '\N{ESCAPE}'
True
>>> ucd.lookup('RS') == '\N{RS}'
True

but

>>> ucd.name('\N{ESCAPE}')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: no such name


>>> ucd.name('\N{RS}')
Traceback (most recent call last):
  File "", line 1, in 
ValueError: no such name

--
messages: 191300
nosy: belopolsky
priority: normal
severity: normal
status: open
title: Unicodedata module should provide access to codepoint aliases
type: enhancement
versions: Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com