Re: RFC: API to access Unicode db files

Karl Williamson Wed, 17 Aug 2011 13:42:37 -0700

Here's a new version of the API for comment, with the addition of 2extra functions:



   prop_invlist()
       "prop_invlist" returns an inversion list (described below)
       that defines all the code points for the Unicode property
       given by the input parameter string:

        use Unicode::UCD 'prop_invlist';
        say join ", ", prop_invlist("Any");

        0, 1114112

       An empty list is returned if the given property is unknown;
       the number of elements in the list is returned if called in
       scalar context.

       perluniprops gives the list of properties that this function
       accepts, as well as all the possible forms for them (loose
       matching rules are used on the parameter).  Note that many
       properties can be specified in a compound form, such as

        say join ", ", prop_invlist("Script=Shavian");
        66640, 66688

        say join ", ", prop_invlist("ASCII_Hex_Digit=No");
        0, 48, 58, 65, 71, 97, 103

        say join ", ", prop_invlist("ASCII_Hex_Digit=Yes");
        48, 58, 65, 71, 97, 103

       Inversion lists are a compact way of specifying Unicode
       properties.  The 0th item in the list is the lowest code
       point that has the property-value.  The next item is the
       lowest code point after that one that does NOT have the
       property-value.  And the next item after that is the lowest
       code point after that one that has the property-value, and so
       on.  Put another way, each element in the list gives the
       beginning of a range that has the property-value (for even
       numbered elements), or doesn't have the property-value (for
       odd numbered elements).

       In the final example above, the first ASCII Hex digit is code
       point 48, the character "0", and all code points from it
       through 57 (a "9") are ASCII hex digits.  Code points 58
       through 64 aren't, but 65 (an "A") through 70 (an "F") are,
       as are 97 ("a") through 102 ("f").  103 starts a range of
       code points that aren't ASCII hex digits.  That range extends
       to infinity, which on your computer can be found in the
       variable $Unicode::UCD::MAX_CP.  (This variable is as close
       to infinity as Perl can get on your platform, and may be too
       high for some operations to work; you may wish to use a
       smaller number for your purposes.)

       The name for this data structure stems from the fact that
       each element in the list toggles (or inverts) whether the
       corresponding range is or isn't on the list.

       It is a simple matter to expand out an inversion list to a
       full list of all code points that have the property-value:

        my @invlist = prop_invlist("My Property");
        die "empty" unless @invlist;
        my @full_list;
        for (my $i = 0; $i < @invlist; $i += 2) {
           my $upper = ($i + 1) < @invlist
                       ? $invlist[$i+1] - 1      # In range

: $Unicode::UCD::MAX_CP; # To infinity. Youmay want# to stop much muchearlier;# going this high mayexpose# perl bugs with verylarge

                                                 # numbers.
           for my $j ($invlist[$i] .. $upper) {
               push @full_list, $j;
           }
        }

   prop_aliases()
           use Unicode::UCD 'prop_aliases';

           my $full_name = prop_value_aliases("White Space");
           my @all_names = prop_value_aliases("White Space");
           my $short_name = $all_names[0];
           print join ", ", @all_names, "\n";

           XXX

       Most Unicode properties have several synonymous names.
       Typically, there is at least a short name, convenient to
       type, and a long name that more fully describes the property,
       and hence is more easily understood.

       If you know one name for a property, you can use
       "prop_aliases" to find either the long name (when called in
       scalar context), or a list of all of the names, somewhat
       ordered so that the short name is in the 0th element, the
       long name in the next element, and any other synonyms in the
       remaining elements, in no particular order.

       The long name is returned in a form nicely capitalized,
       suitable for printing.

       White space, hyphens, and underscores are ignored in the
       input parameter name.

       If the name is unknown, "undef" is returned.

   prop_value_aliases()
           use Unicode::UCD 'prop_value_aliases';

           my $full_name = prop_value_aliases("Gc", "Punct");
           my @all_names = prop_value_aliases("Gc", "Punct");
           my $short_name = $all_names[0];
           print "The aliases are: ", join ", ", @all_names, "\n";
           print "The fullname is $full_name\n";

           The aliases are: P, Punctuation, Punct
           The fullname is Punctuation

       Some Unicode properties have a restricted set of legal
       values.  For example, all binary properties are restricted to
       just "true" or "false"; and there are only a few dozen
       possible General Categories.

       For such properties, there are usually several synonyms for
       each possible value.  For example, in binary properties,
       truth can be represented by any of the strings, "Y", "Yes",
       "T", or "True"; and the General Category "Punctuation" by
       that string, or "Punct", or simply "P".

       Like property names, there is typically at least a short name
       for each such property-value, and a long name.  If you know
       any name of the property-value, you can use
       "prop_value_aliases"() to get the long name (when called in
       scalar context), or a list of all the names, with the short
       name in the 0th element, the long name in the next element,
       and any other synonyms in the remaining elements, in no
       particular order, except that any all-numeric synonyms will
       be last.

       The long name is returned in a form nicely capitalized,
       suitable for printing.

       White space, hyphens, and underscores are ignored in the
       input parameters.

       If either name is unknown, "undef" is returned.

       If called with a property that doesn't have synonyms for its
       values, it returns the input value, possibly normalized with
       capitalization and underscores.

       For the block property, new-style block names are returned
       (see "Old-style versus new-style block names").

   prop_invmap()
       "prop_invmap" is used to get the complete mapping definition
       for a property, in the form of an inversion map.  An
       inversion map consists of two parallel arrays.  One is an
       ordered list of code points that mark range beginnings, and
       the other gives the value (or mapping) that all code points
       in the corresponding range have.

       "prop_invmap" is called with the name of the desired
       property.  The name is loosely matched, meaning that
       differences in case, white-space, hyphens, and underscores
       are not meaningful.  Many Unicode properties have more than
       one name (or alias).  "prop_invmap" understands all of these.
       "undef" is returned if the property name is unknown.

       It is a fatal error to call this function except in list
       context.

       In addition to the the two arrays that form the inversion
       map, "prop_invmap" returns two other values, one is a scalar
       that gives some details as to the format of the entries of
       the map array; the other is used for specialized purposes,
       described at the end of this section.

       This means that "prop_invmap" returns a 4 element list.  For
       example,

        my ($blocks_ranges_ref, $blocks_maps_ref, $format, $default)
                                             = prop_invmap("Block");

       In this call, the two arrays will be populated as shown below
       (for Unicode 6.0):

        Index  @blocks_ranges  @blocks_maps
          0        0x0000      Basic Latin
          1        0x0080      Latin-1 Supplement
          2        0x0100      Latin Extended-A
          3        0x0180      Latin Extended-B
          4        0x0250      IPA Extensions
          5        0x02B0      Spacing Modifier Letters
          6        0x0300      Combining Diacritical Marks
          7        0x0370      Greek and Coptic
          8        0x0400      Cyrillic
         ...
        233        0x2B820     No_Block
        234        0x2F800     CJK Compatibility Ideographs Supplement
        235        0x2FA20     No_Block
        236        0xE0000     Tags
        237        0xE0080     No_Block
        238        0xE0100     Variation Selectors Supplement
        239        0xE01F0     No_Block
        240        0xF0000     Supplementary Private Use Area-A
        241        0x100000    Supplementary Private Use Area-B
        242        0x110000    No_Block

       The first line (with Index 0) means that the value for code
       point 0 is "Basic Latin".  The entry "0x0080" in the
       @blocks_ranges column in the second line means that the value
       from the first line, "Basic Latin", extends to all code
       points in the range up to but not including 0x0080, that is,
       to 255.  In other words, the code points from 0 to 255 are
       all in the "Basic Latin" block.  Similarly, all code points
       in the range from 0x0080 up to (but not including) 0x0100 are
       in the block named "Latin-1 Supplement", etc.  (Notice that
       the return is the old-style block names; see "Old-style
       versus new-style block names").

       The final line (with Index 242) means that the value for all
       code points above the legal Unicode maximum code point have
       the value "No_Block", which is the term Unicode uses for a
       non-existing block.

       The arrays completely specify the mappings for all possible
       code points.  The final element in an inversion map returned
       by this function will always be for the range that consists
       of all the code points that aren't legal Unicode, but that
       are expressible on the platform.  (That is, it starts with
       code point 0x110000, the first code point above the legal
       Unicode maximum, and extends to infinity.) The value for that
       range will be the same that any normal unassigned code point
       has for the specified property.  (Certain unassigned code
       points are not "normal"; for example the non-character code
       points, or those in blocks that are to be written right-to-
       left.  The range value will not necessarily be the same as
       those code points have.)  It could be argued that, instead of
       treating these as unassigned Unicode code points, the value
       for this range should be "undef".  You can make that decision
       and change the returned array accordingly.

       The maps are almost always simple scalars that should be
       interpreted as-is.  These values are those given in the
       Unicode data files, which may be inconsistent as to
       capitalization and which synonym for a property-value is
       given.  The results may be normalized by using the
       "prop_value_aliases()" function.

       There are exceptions to the simple scalar maps.  Some
       properties have some elements in their map list that are
       themselves lists of scalars; and some special strings are
       returned that are not to be interpreted as-is.  Element [2]
       (placed into $format in the example above) of the returned 4
       element list tells you if the map has any of these special
       elements, as follows:

       "s" means all the elements of the map array are simple
           scalars.  Almost all properties are like this, like the
           "block" example above.

       "sl"
           means that some of the map array elements have the form
           given by "s", and the rest are lists of scalars.  For
           example, here is a portion of the output of calling
           "prop_invmap"() with the "Script Extensions" property:

            @scripts_ranges  @scripts_maps
                 ...
                 0x0953      Deva
                 0x0964      [ Beng Deva Guru Orya ]
                 0x0966      Deva
                 0x0970      Common

           Here, the code points 0x964 and 0x965 are used in the
           Bengali, Devanagari, Gurmukhi, and Oriya  scripts.

       "r" means that all the elements of the map array are either
           rational numbers or the string "NaN", meaning "Not a
           Number".  A rational number is either an integer, or two
           integers separated by a solidus ("/").  The second
           integer represents the denominator of the division
           implied by the solidus, and is guaranteed not to be 0.
           If you want to convert them to scalar numbers, you can
           use something like this:

            my ($format, $invlist_ref, $invmap_ref)
                                    = prop_invmap($property);
            if ($format && $format eq "r") {
                map { $_ = eval $_ } @$invmap_ref;
            }

           Here's some entries from the output of the property "Nv",
           which has format "r".

            @numerics_ranges  @numerics_maps        Note
                   0x00             "NaN"
                   0x30             0              DIGIT 0
                   0x31             1
                   0x32             2
                   ...
                   0x37             7
                   0x38             8
                   0x39             9              DIGIT 9
                   0x3A             "NaN"
                   0xB2             2              SUPERSCRIPT 2
                   0xB3             3              SUPERSCRIPT 2
                   0xB4             "NaN"
                   0xB9             1              SUPERSCRIPT 1
                   0xBA             "NaN"
                   0xBC             1/4            VULGAR FRACTION 1/4
                   0xBD             1/2            VULGAR FRACTION 1/2
                   0xBE             3/4            VULGAR FRACTION 3/4
                   0xBF             "NaN"
                   0x660            0          ARABIC-INDIC DIGIT ZERO

       "c" is like "s" in that all the map array elements are
           scalars, but some of them are the special string
           "<code point>", meaning that the map of each code point
           in the corresponding range in the inversion list is the
           code point itself.  For example, in:

            my ($format, $uppers_ranges_ref, $uppers_maps_ref)
                      = prop_invmap("Simple_Uppercase_Mapping");

           the returned arrays look like this:

            @$uppers_ranges_ref    @$uppers_maps_ref   Note
                  0                 "<code point>"
                 97                     65          'a' maps to 'A'
                 98                     66          'b' => 'B'
                 99                     67          'c' => 'C'
                 ...
                120                     88          'x' => 'X'
                121                     89          'y' => 'Y'
                122                     90          'z' => 'Z'
                123                "<code point>"
                181                    924          MICRO SIGN =>
                                                    Greek Cap MU
                182                "<code point>"
                ...

           The first line means that the uppercase of code point 0
           is 0; the uppercase of code point 1 is 1; ...  of code
           point 96 is 96.  Without the "<code_point>" notation,
           every code point would have to have an entry.  This would
           mean that the arrays would each have more than a million
           entries to list just the legal Unicode code points!

       "cl"
           means that some of the map array elements have the form
           given by "c", and the rest are ordered lists of code
           points.  For example, in:

            my ($format, $uppers_ranges_ref, $uppers_maps_ref)
                               = prop_invmap("Uppercase_Mapping");

           the returned arrays look like this:

            @$uppers_ranges_ref    @$uppers_maps_ref       Note
                  0                 "<code point>"
                 97                     65
                ...
                122                     90
                123                "<code point>"
                181                    924
                182                "<code point>"
                ...
               0x0149              [ 0x02BC 0x004E ]

           This is the full Uppercase_Mapping property (as opposed
           to the Simple_Uppercase_Mapping given in the example for
           "c").  The only difference between the two in the ranges
           shown is that the code point at 0x0149 (LATIN SMALL
           LETTER N PRECEDED BY APOSTROPHE) maps to a string of two
           characters, 0x02BC (MODIFIER LETTER APOSTROPHE) followed
           by 0x004E (LATIN CAPITAL LETTER N).

       "n" means the Name property.  All the elements of the map
           array are simple scalars, but some of them contain
           special strings that require more work to get the actual
           name.

           Entries such as:

            CJK UNIFIED IDEOGRAPH-<code point>

           mean that the name for the code point is "CJK UNIFIED
           IDEOGRAPH-" with the code point (expressed in
           hexadecimal) appended to it (similarly for "CJK
           COMPATIBILITY IDEOGRAPH-<code point>").

           Also, entries like

            <hangul syllable>

           means that the name is algorithmically calculated.  This
           is easily done by the function charnames::viacode().

           Note that for control characters ("Gc=cc"), Unicode's
           data files have the string ""control"", but the real name
           of each of these characters is the empty string.  This
           function returns the real name.

       "d" means the Decomposition_Mapping property.  Like "n", this
           property uses

            <hangul syllable>

           for those code points whose decomposition is
           algorithmically calculated.  These can be generated via
           the function Unicode::Normalize::NFD().

           Otherwise, this property is like "cl" properties.

           Note that the mapping is the one that is specified in the
           Unicode data files, and to get the final decomposition,
           it may need to be applied recursively.

       A binary search can be used to quickly find a code point in
       the inversion list, and hence its corresponding mapping.

       The final element ([3], assigned to $default in the "block"
       example) in the list returned by this function may be useful
       for applications that wish to convert the returned inversion
       map data structure into some other, such as a hash.  It gives
       the mapping that most code points map to under the property.
       If you establish the convention that any code point not
       explicitly listed in your data structure maps to this value,
       you can potentially make your data structure much smaller.
       As you construct your data structure from the one returned by
       this function, simply ignore those ranges that map to this
       value, generally called the "default" value.

       One internal Perl property is accessible by this function.
       "Perl_Decimal_Digit" returns an inversion map in which all
       the Unicode decimal digits map to their numeric values, and
       everything else to the empty string, like so:

        @digits    @values
        0x0000       ""
        0x0030       0
        0x0031       1
        0x0032       2
        0x0033       3
        0x0034       4
        0x0035       5
        0x0036       6
        0x0037       7
        0x0038       8
        0x0039       9
        0x003A       ""
        0x0660       0
        0x0661       1
        ...

   Old-style versus new-style block names
       Unicode publishes the names of blocks in two different
       styles, though the two are equivalent under Unicode's loose
       matching rules.

       The original style uses blanks and hyphens in the block names
       (except for "No_Block"), like so:

        Miscellaneous Mathematical Symbols-B

       The newer style replaces these with underscores, like this:

        Miscellaneous_Mathematical_Symbols_B

       This newer style is consistent with the values of other
       Unicode properties.  To preserve backward compatibility, all
       the functions in Unicode::UCD that return block names (except
       one) return the old-style ones.  That one function,
       "prop_value_aliases"() can be used to convert from old-style
       to new-style:

        my $new_style = prop_values_aliases("block", $old_style);

Re: RFC: API to access Unicode db files

Reply via email to