Glenn,

> Sounds like it might be fairly big, too, which
> is annoying for most small- and medium- sized projects--most OSS
> developers are very hesitant to add a major dependancy, especially the
> cross-platform ones (where this will often mean shipping binary packages
> along with the runtimes for the dependancy.)

The UTF-8 support routines are very small and you actually only have a few
of them for example:

UChar8 * xiu8_strpbrk(UChar8 *string,
                        UChar8 * charlst)
{
        XIUA_Thread * curr_thread;
        UChar8 * ptr1;
        UChar8 * ptr2;
        int i, j;

        curr_thread = xiux_getCurrentThread();
        ptr1 = string;
        while (*ptr1)
        {
                ptr2 = charlst;
                while (*ptr2)
                {
                        i = xbytesFromUTF8[*ptr2];
                        if (i < 1)
                        {
                                curr_thread->status = U_INVALID_CHAR_FOUND;
                                return ptr1;
                        }
                        for (j=0;j<i;j++)
                        {
                                if (ptr1[j] != ptr2[j]) break;
                        }
                        if (j == i)     return ptr1;
                        ptr2 += i;
                }
                i = xbytesFromUTF8[*ptr1];
                if (i < 1)
                {
                        curr_thread->status = U_INVALID_CHAR_FOUND;
                        return ptr1;
                }
                ptr1 += i;
        }
        return(NULL);
}

You will notice that the UTF-8 character length is a simple table lookup.
(very fast - only 256 bytes).  This routine will return errors if passed bad
UTF-8 data.  This is important because with character only data there are no
bad codes.

This specific code uses thread locale storage to save the error code.  You
can cut some of the overhead by implementing a not thread safe version.
This code is compiled without thread support uses a static pointed to the
same structure so that you have one extra level of indirection to keep the
code the same.

Carl


-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to