Hi Simon, > there aren't any enum's for the various scripts, so my code ends up > looking like this: > > case 0x30FB: > { > /* KATAKANA MIDDLE DOT */ > size_t i; > bool script_ok = false; > > for (i = 0; !script_ok && i < llen; i++) > if (strcmp (uc_script (label[i])->name, "Hiragana") == 0 > || strcmp (uc_script (label[i])->name, "Katakana") == 0 > || strcmp (uc_script (label[i])->name, "Han") == 0) > script_ok = true;
I would write this code as case 0x30FB: { /* KATAKANA MIDDLE DOT */ size_t i; const uc_script_t *hiragana_script = uc_script_byname ("Hiragana"); const uc_script_t *katakana_script = uc_script_byname ("Katakana"); const uc_script_t *han_script = uc_script_byname ("Han"); bool script_ok = false; for (i = 0; !script_ok && i < llen; i++) if (uc_is_script (label[i], hiragana_script) || uc_is_script (label[i], katakana_script) || uc_is_script (label[i], han_script)) script_ok = true; > I think the fundamental > issue is the use of strings for describing scripts. I think there is no issue. You can use 'const uc_script_t *' values like you would use enumerated values, with the only exception that you cannot use them in static initializers and have to use initialization at runtime instead. And this is due to the fact that 1) scripts contain more information than just a name, 2) the set of scripts is increasing with every Unicode release. > The APIs for finding out the script of a code point is this: > > http://www.gnu.org/software/libunistring/manual/libunistring.html#Scripts Attention: I'm just redesigning this API, and the blocks API, right now. The existing API is not extensible and costs a lot of relocations at startup time. My current draft for the new API is this: typedef struct { unsigned int code : 21; unsigned int start : 1; unsigned int end : 1; } uc_interval_t; typedef struct uc_script *uc_script_t; /* Return the script of a Unicode character. */ extern uc_script_t uc_script (ucs4_t uc); /* Return the name of a script. */ extern const char * uc_script_name (uc_script_t script); /* Return the abbreviated name (ISO 15924) of a script. */ extern const char * uc_script_iso_name (uc_script_t script); /* Return the code (ISO 15924) of a script. */ extern int uc_script_iso_code (uc_script_t script); /* Return the intervals of a script. *NINTERVALS is set to the number of intervals. *INTERVALS is set to a pointer to *NINTERVALS intervals. */ extern void uc_script_intervals (uc_script_t script, const uc_interval_t **intervals, size_t *nintervals); /* Return the script given by name, e.g. "HAN", or abbreviated name (ISO 15924). */ extern uc_script_t uc_script_byname (const char *script_name); /* Return the script given by code (ISO 15924). */ extern uc_script_t uc_script_bycode (int script_code); /* Test whether a Unicode character belongs to a given script. */ extern bool uc_is_script (ucs4_t uc, uc_script_t script); /* Get the list of all scripts. *COUNT is set to the number of scripts. *SCRIPTS is set to a pointer to *COUNT scripts. */ extern void uc_all_scripts (uc_script_t **scripts, size_t *count); > Or is there any particular reason there an enum for each script isn't > used? Like what is used for joining types and combining classes (for > example). You're right: Since the user cannot allocate scripts by himself, particular values of 'uc_script_t' can also be assigned to specific integers. I think I can implement your suggestion more easily with the new API than with the old one. Bruno -- In memoriam Georg Lehnig <http://de.wikipedia.org/wiki/Georg_Lehnig>