Hi, All,

I want to give some of my thougts about string encoding.

Personally I like the UTF-8 encoding. The solution to the
variable length can be handled by a special (virtual)
function like

class String {
    virtual UV iterate(/*inout*/ int* index);
};

So in typical string iteration, the code will looks like
    for (i = 0; i < size;) {
        UV ch = s->iterate(&i);
        /* do what u want */
    }
instead of
    for (i = 0; i < size; i++) {
        uint32 ch = s->charAt(i);
        /* be my guest */
    }

The new style will be strange, but not very difficult to
use. It also hide the internal representation.

The UTF-32 suggestion is largely ignorant to internationalization.
Many user characters are composed by more than one unicode code
point. If you consider the unicode normalization, canonical form,
hangul conjoined, hindic cluster, combining character, varama,
collation, locale, UTF-32 will not help you much, if at all.

Hong

Reply via email to