[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Tom Christiansen Fri, 30 Sep 2011 15:07:34 -0700

Tom Christiansen <[email protected]> added the comment:

>Ezio Melotti <[email protected]> added the comment:


> Leaving named sequences for unicodedata.lookup() only (and not for
> \N{}) makes sense.

There are certainly advantages to that strategy: you don't have to
deal with [\N{sequence}] issues.  If the argument to unicode.lookup()
and be any of name, alias, or sequence, that seems ok.  \N{} should
still do aliases, though, since those don't have the complication that
sequences have.

You may wish unicode.name() to return the alias in preference, however.
That's what we do.  And of course, there is no issue of sequences there.

The rest of this perhaps painfully long message is just elaboration
and icing on what I've said above.

--tom

> The list of aliases is so small (11 entries) that I'm not sure using a
> binary search for it would bring any advantage.  Having a single
> lookup algorithm that looks in both tables doesn't work because the
> aliases lookup must be in _getcode for \N{...} to work, whereas the
> lookup of named sequences will happen in unicodedata_lookup
> (Modules/unicodedata.c:1187).  I think we can leave the for loop over
> aliases in _getcode and implement a separate (and binary) search in
> unicodedata_lookup for the named sequences.  Does that sound fine?

If you mean, is it ok to add just the aliases and not the named sequences to
\N{}, it is certainly better than not doing so at all.  Plus that way you do
*not* have to figure out what in the world to to do with [^a-c\N{sequence}],
since that would have be something like (?!\N{sequence})[^a-c]), which is 
hardly obvious, especially if \N{sequence} actually starts with [a-c].

However, because the one namespace comprises all three of names,
aliases, and named sequences, it might be best to have a functional
(meaning, non-regex) API that allows one to do a fetch on the whole
namespace, or on each individual component.

The ICU library supports this sort of thing.  In ICU4J's Java bindings, 
we find this:

    static int getCharFromExtendedName(String name) 
       [icu] Find a Unicode character by either its name and return its code 
point value.
    static int  getCharFromName(String name) 
       [icu] Finds a Unicode code point by its most current Unicode name and 
return its code point value.
    static int  getCharFromName1_0(String name) 
       [icu] Find a Unicode character by its version 1.0 Unicode name and 
return its code point value.
    static int  getCharFromNameAlias(String name) 
       [icu] Find a Unicode character by its corrected name alias and return 
its code point value.

The first one obviously has a bug in its definition, as the English
doesn't scan.  Looking at the full definition is even worse.  Rather
than dig out the src jar, I looked at ICU4C, but its own bindings are
completely different.  There you have only one function, with an enum to
say what namespace to access:

    UChar32 u_charFromName  (       UCharNameChoice         nameChoice, 
                    const char *    name, 
                    UErrorCode *    pErrorCode 
            )

The UCharNameChoice enum tells what sort of thing you want:

    U_UNICODE_CHAR_NAME,
    U_UNICODE_10_CHAR_NAME,
    U_EXTENDED_CHAR_NAME,
    U_CHAR_NAME_ALIAS,          
    U_CHAR_NAME_CHOICE_COUNT

Looking at the src for the Java is no more immediately illuminating, 
but I think that "extended" may refer to a union of the old 1.0 names 
with the current names.

Now I'll tell you what Perl does.  I do this not to say it is "right",
but just to show you one possible strategy.  I also am in the middle
of writing about this for the Camel, so it is in my head.

Perl does not provide the old 1.0 names at all.  We don't have a Unicode
1.0 legacy to support, which makes this cleaner.  However, we do provide
for the names of the C0 and C1 Control Codes, because apart from Unicode
1.0, they don't condescend to name the ASCII or Latin1 control codes.  

We also provide for certain well known aliases from the Names file:
anything that says "* commonly abbreviated as ...", so things like LRO
and ZWJ and such.

Perl makes no distinction between anything in the namespace when using
the \N{} form for string and regex escapes.  That means when you use
"\N{...}" or /\N{...}/, you don't know which it is, nor can you.
(And yes, the bracketed character class issue is annoying and unsolved.)

However, the "functional" API does make a slight distinction.  

 -- charnames::vianame() takes a name or alias (as a string) and returns a 
single 
        integer code point.

        eg: This therefore converts "LATIN SMALL LETTER A" into 0x61.
            It also converts both 
                BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
            and 
                BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
            into 0x1D0C5.  See below.

 -- charnames::string_vianame() takes a string name, alias, *or* sequence, 
        and gives back a string.   

        eg: This therefore converts "LATIN SMALL LETTER A" into "a".
            Since it has a string return instead of an int, it now also
            handles everything from NamedSequences file as well. (See below.)

 -- charnames::viacode() takes an integer can gives back the official alias 
        if there is one, and the official name if there is not.

        eg: This converts 0x61 into "LATIN SMALL LETTER A".
            It also converts 0x1D0C5 into "BYZANTINE MUSICAL SYMBOL FTHORA
            SKLIRON CHROMA VASIS".

Consider

    BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS

That was an error, and there is an official alias fixing it:

    BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS

(That's FHTORA vs FTHORA.)

You may use either as the name, and if you reverse the code 
point to name, you get the replacement alias.

 % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE 
MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS")'
 1D0C5

 % perl -mcharnames -wle 'printf "%04X\n", charnames::vianame("BYZANTINE 
MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS")'
 1D0C5

 % perl -mcharnames -wle 'print 
charnames::viacode(charnames::vianame("BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON 
CHROMA VASIS"))'
 BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS

So on round-tripping, I gave it the "wrong" one (the original) and it gave
me back the "right" one (the replacement).

Using the \N{} thing, it again doesn't matter:

 % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL 
SYMBOL FHTORA SKLIRON CHROMA VASIS}"'
 1D0C5

 % perl -mcharnames=:full -wle 'printf "%04X\n", ord "\N{BYZANTINE MUSICAL 
SYMBOL FTHORA SKLIRON CHROMA VASIS}"'
 1D0C5

The interesting thing is the named sequences. string_vianame() works just fine 
on those:

 % perl -mcharnames -wle 'print length charnames::string_vianame("LATIN CAPITAL 
LETTER A WITH MACRON AND GRAVE")'
 2

 % perl -mcharnames -wle 'printf "U+%v04X\n",  charnames::string_vianame("LATIN 
CAPITAL LETTER A WITH MACRON AND GRAVE")'
 U+0100.0300

And that works fine with \N{} as well (provided you don't try charclasses):

 % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON 
AND GRAVE}"'
 Ā̀

 % perl -mcharnames=:full -wle 'print "\N{LATIN CAPITAL LETTER A WITH MACRON 
AND GRAVE}"' | uniquote -v
 \N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}

 % perl -mcharnames=:full -wle 'print length "\N{LATIN CAPITAL LETTER A WITH 
MACRON AND GRAVE}"'
 2

 % perl -mcharnames=:full -wle 'printf "U+%v04X\n", "\N{LATIN CAPITAL LETTER A 
WITH MACRON AND GRAVE}"'
 U+0100.0300

It's kinda sad that for \N{} and sequneces you can't just "do the right
thing" with strings and say that charclass stuff just isn't supported.
But my guess is that this simply won't work because you don't have 
first class regexes.  If you pass both of these to the regex engine,
they should behave the same (and would, assuming the regex compiler
knows about \N{} escapes):

    "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"
    r'\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}'

However, that falls part if you do 

    "[^\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]"
    r'[^\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]'

Because the compiler will do the substitution early on the first
one but not the second.  This seems a problem, eh?  So I guess
you can't do it at all?  Or could you document it?   I think there
is no good solution here.  Perl can and does actually do something
quite reasonable in the noncharclass case, but that is because we
know that we are compiling a regex in virtually all scenarios.

    % perl -Mcharnames=:full -le 'print qr/\N{LATIN SMALL LETTER A}/'
    (?^u:\N{U+61})

    % perl -Mcharnames=:full -le 'print qr/\N{LATIN CAPITAL LETTER A WITH 
MACRON}/'
    (?^u:\N{U+100})

    % perl -Mcharnames=:full -le 'print qr/\N{LATIN CAPITAL LETTER A WITH 
MACRON AND GRAVE}/'
    (?^u:\N{U+100.300})

So you can do:

    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON 
AND GRAVE}" =~ /\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}/'
    1

And it is just fine.  The issue is that there are ways for you to get
yoruself into trouble if you do string-string stuff:

    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON 
AND GRAVE}" =~ "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"'
    1
    % perl -Mcharnames=:full -le 'print "\N{LATIN CAPITAL LETTER A WITH MACRON 
AND GRAVE}" =~ "^[\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}]+\$"'
    1

That works, but only accidentally, because of course U+0100.0300 contains
nothing but either U+0100 or U+0300.

This is not a solved problem.

I hope this helps.

--tom

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12753>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Reply via email to