Re: string encoding

Branden Fri, 16 Feb 2001 04:53:21 -0800
Dan Sugalski wrote:
> At 05:09 PM 2/15/2001 -0800, Hong Zhang wrote:
> >People in Japan/China/Korea have been using multi-byte encoding for
> >long time. I personally have used it for more 10 years. I never feel
> >much of the "pain". Do you think I are using my computer with O(n)
> >while you are using it with O(1)? There are 100 million people using
> >variable-length encoding!!!
>
> Not at this level they aren't. The people actually writing the code do
feel
> the pain, and you do pay a computational price. You can't *not* pay the
price.
>
>    substr($foo, 233253, 14)
>
> is going to cost significantly more with variable sized characters than
> fixed sized ones.
>

All depends on what expenses you're measuring. If it's processor speed, yes,
substr() will cost more on variable sized characters. If it's memory usage,
no, converting a UTF-8 with 1 or 2 more-than-16-bits characters on it to
UTF-32 will use almost 4 times the necessary memory. And perhaps you don't
even want to pick a substr() of it. Maybe you're just reading it from one
file and writing to another, why pay the price (of both processor and
memory) of converting it to and from UTF-32 just to do that?



> No, we don't. We do *not* have to live with it at all. That UTF-8 is a
> variable-length representation is an implementation detail, and one we are
> not required to live with internally.

Which doesn't mean we're not allowed to live with it internally...

> If UTF-16 (which is also variable
> width, annoyingly) or UTF-32 (which doesn't officially exist as far as I
> can tell, but we can define by fiat)

Actually, I think it does exist. At least it's what I've read in the FAQ on
www.unicode.org. As they say, it does have fixed character size.

> is better for us, then great. They're
> all just different ways of representing Unicode abstract characters. (I
> think--I'm only up to chapter 3 of the unicode 3.0 book)
>

As UTF-8 also is.



> Besides, I think you're arguing a completely different point, and I think
> it's been missed generally. Where we're going to get bit hard, and I can't
> see a way around, is combining characters. The individual Unicode abstract
> characters can have a fixed-width representation, but the number of
Unicode
> characters per 'real' character is variable, and I can't see any way
around
> that. (It looks like it's legal to stack four or six modifier characters
on
> a base character, and I don't think I'm willing to go so far as to use
> UTF-128 internally.

If it's abstracted by an API, I don't see any problem with having
128-bit-wide characters. They would be stored internally either with 128
bits per character, or with UTF-8 or UTF-16, or even compressed, or other
magic stuff. By its external API, anyone could request a character of the
desired width transparently.



> That's a touch much, even for me... :) Then there also
> seems to be metadata embedded in the Unicode standard--stuff like the
> bidirectional ordering and alternate formatting characters. Bleah.
>
> [snip]
>
> Unicode is making my head hurt. I do *not* have an appropriate language
> background to feel comfortable with this, even for the pieces that are
> relevant to the languages I have any familiarity with.
>

I guess Unicode is too complex and a bit unstable to hardwire in our code.
When Java 1.0 was released, UTF-16 was to be the `definitive' wide-character
encoding, so Java defined its `char' type as 16-bits and its `String' class
as having a array of `char's. Now Unicode states 16-bits isn't enough for
all applications, so they defined surrogate pairs and they're saying 32-bits
is all it's needed. Well, Java probably won't survive the proliferation of
surrogate pairs, because its `String' class isn't flexible, even with the
fact that its `charAt()' method returns an integer (I guess...).

Anyway, different strings have probably different requirements. There are
probably some short ones in which I need good performance and I don't mind
wasting 32+ bits per character, but there are other big ones which I need to
store efficiently and I don't care if substr() will take longer, I need it
compact! There would probably be other cases in which I need a balance of
speed and compact storage, and I'll probably go into the trouble of
implemented a variable-width indexed string approach, but if the language
doesn't support strings through a well defined API, I would never have how
to do that!

I'm coming up with a proposal that would build a vtable-based API for
dealing with strings in an efficient way, without loosing generality. I
think I'll have it ready by next week, then I post it here.

- Branden
Re: string encoding

Reply via email to