Re: [bug-libunistring] Script APIs

Bruno Haible Mon, 28 Mar 2011 18:01:56 -0700

Hi Simon,

> there aren't any enum's for the various scripts, so my code ends up
> looking like this:
> 
>     case 0x30FB:
>       {
>       /* KATAKANA MIDDLE DOT */
>       size_t i;
>       bool script_ok = false;
> 
>       for (i = 0; !script_ok && i < llen; i++)
>         if (strcmp (uc_script (label[i])->name, "Hiragana") == 0
>             || strcmp (uc_script (label[i])->name, "Katakana") == 0
>             || strcmp (uc_script (label[i])->name, "Han") == 0)
>           script_ok = true;


I would write this code as

     case 0x30FB:
       {
       /* KATAKANA MIDDLE DOT */
       size_t i;
       const uc_script_t *hiragana_script = uc_script_byname ("Hiragana");
       const uc_script_t *katakana_script = uc_script_byname ("Katakana");
       const uc_script_t *han_script = uc_script_byname ("Han");
       bool script_ok = false;

       for (i = 0; !script_ok && i < llen; i++)
         if (uc_is_script (label[i], hiragana_script)
             || uc_is_script (label[i], katakana_script)
             || uc_is_script (label[i], han_script))
           script_ok = true;

> I think the fundamental
> issue is the use of strings for describing scripts.

I think there is no issue. You can use 'const uc_script_t *' values like you
would use enumerated values, with the only exception that you cannot use them
in static initializers and have to use initialization at runtime instead.
And this is due to the fact that
  1) scripts contain more information than just a name,
  2) the set of scripts is increasing with every Unicode release.

> The APIs for finding out the script of a code point is this:
> 
> http://www.gnu.org/software/libunistring/manual/libunistring.html#Scripts

Attention: I'm just redesigning this API, and the blocks API, right now.
The existing API is not extensible and costs a lot of relocations at startup
time. My current draft for the new API is this:

typedef struct
{
  unsigned int code : 21;
  unsigned int start : 1;
  unsigned int end : 1;
}
uc_interval_t;

typedef struct uc_script *uc_script_t;

/* Return the script of a Unicode character.  */
extern uc_script_t
       uc_script (ucs4_t uc);

/* Return the name of a script.  */
extern const char *
       uc_script_name (uc_script_t script);

/* Return the abbreviated name (ISO 15924) of a script.  */
extern const char *
       uc_script_iso_name (uc_script_t script);

/* Return the code (ISO 15924) of a script.  */
extern int
       uc_script_iso_code (uc_script_t script);

/* Return the intervals of a script.
   *NINTERVALS is set to the number of intervals.
   *INTERVALS is set to a pointer to *NINTERVALS intervals.  */
extern void
       uc_script_intervals (uc_script_t script,
                            const uc_interval_t **intervals,
                            size_t *nintervals);

/* Return the script given by name, e.g. "HAN", or abbreviated name
   (ISO 15924).  */
extern uc_script_t
       uc_script_byname (const char *script_name);

/* Return the script given by code (ISO 15924).  */
extern uc_script_t
       uc_script_bycode (int script_code);

/* Test whether a Unicode character belongs to a given script.  */
extern bool
       uc_is_script (ucs4_t uc, uc_script_t script);

/* Get the list of all scripts.
   *COUNT is set to the number of scripts.
   *SCRIPTS is set to a pointer to *COUNT scripts.  */
extern void
       uc_all_scripts (uc_script_t **scripts, size_t *count);


> Or is there any particular reason there an enum for each script isn't
> used?  Like what is used for joining types and combining classes (for
> example).

You're right: Since the user cannot allocate scripts by himself, particular
values of 'uc_script_t' can also be assigned to specific integers. I think
I can implement your suggestion more easily with the new API than with the
old one.

Bruno
-- 
In memoriam Georg Lehnig <http://de.wikipedia.org/wiki/Georg_Lehnig>

Re: [bug-libunistring] Script APIs

Reply via email to