Re: string encoding

Branden Fri, 16 Feb 2001 04:13:50 -0800
Simon Cozens wrote:
> On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote:
> > The concept of characters have nothing to do with codepoints.
> > Many characters are composed by more than one codepoints.
>
> This isn't true.
>

Yes, for UTF-16 it is. For UTF-32 it isn't, but unless you want to read a
100KB UTF-8 file that contains one more-than-16-bits-character encoded in it
and get 400KB of wasted memory, I think UTF-32 is not what should be used
for the general case.

And sometimes, I don't need random access to the string. I only need to do
some pattern matches (that AFAIK go sequentially) and maybe print the
string. Variable-width character encoding is a better solution to various
problems, even it having problems like the sequential access thing. I don't
see any problem in having it both ways (or even other ways, like indexes,
trees, and such things on strings) and having an abstract API to deal with
them transparently, like Hong suggested.

- Branden
Re: string encoding

Reply via email to