At 05:09 PM 2/15/2001 -0800, Hong Zhang wrote:
> > ...and because of this you can't randomly access the string, you are
> > reduced to sequential access (*). And here I thought we could have
> > left tape drives to the last millennium.
> >
> > (*) Yes, of course you could cache your sequential access so you only
> > need to do it once, and build balanced trees and whatnot out of those
> > offsets to have random access emulated in O(n lg n), but as soon as
> > you update the string, you have to update the tree, or whatever data
> > structure you chose. Pain, pain, pain.
>
>People in Japan/China/Korea have been using multi-byte encoding for
>long time. I personally have used it for more 10 years. I never feel
>much of the "pain". Do you think I are using my computer with O(n)
>while you are using it with O(1)? There are 100 million people using
>variable-length encoding!!!
Not at this level they aren't. The people actually writing the code do feel
the pain, and you do pay a computational price. You can't *not* pay the price.
substr($foo, 233253, 14)
is going to cost significantly more with variable sized characters than
fixed sized ones.
>Take this example, in Chinese every character has the same width, so
>it is very easy to format paragraphs and lines. Most English web pages
>are rendered using "Times New Roman", which is a variable-width font.
>Do you think the English pages are rendered O(n) while Chinese page
>are rendered O(1)?
You need a better example, since that one's rather muddy. It's a matter of
characters per word, not pixels per character. But generally speaking,
Chinese pages will be rendered with less computational cost associated with
the layout than pages with variable-width characters.
>As I said there are many more hard problems than UTF-8. If you want
>to support i18n and l10n, you have to live with it.
No, we don't. We do *not* have to live with it at all. That UTF-8 is a
variable-length representation is an implementation detail, and one we are
not required to live with internally. If UTF-16 (which is also variable
width, annoyingly) or UTF-32 (which doesn't officially exist as far as I
can tell, but we can define by fiat) is better for us, then great. They're
all just different ways of representing Unicode abstract characters. (I
think--I'm only up to chapter 3 of the unicode 3.0 book)
Besides, I think you're arguing a completely different point, and I think
it's been missed generally. Where we're going to get bit hard, and I can't
see a way around, is combining characters. The individual Unicode abstract
characters can have a fixed-width representation, but the number of Unicode
characters per 'real' character is variable, and I can't see any way around
that. (It looks like it's legal to stack four or six modifier characters on
a base character, and I don't think I'm willing to go so far as to use
UTF-128 internally. That's a touch much, even for me... :) Then there also
seems to be metadata embedded in the Unicode standard--stuff like the
bidirectional ordering and alternate formatting characters. Bleah.
It looks like, for us to do Unicode properly with all the world's
languages, we might have to have a tagged text format like we've been
talking about for other things (XML and suchlike stuff). And I'm not
anywhere near sure what we should do for substitutions. If you have the
sequence:
LATIN SMALL LETTER A, COMBINING TILDE
and do a s/a/b/, should you then have
LATIN SMALL LETTER B, COMBINING TILDE
and if not, if you do a s/SMALL LETTER A WITH TILDE/q/ on the sequence,
should you end up with
LATIN SMALL LETTER Q
or not? The original sequence was two separate characters and the match was
one, but they are really the same thing.
Unicode is making my head hurt. I do *not* have an appropriate language
background to feel comfortable with this, even for the pieces that are
relevant to the languages I have any familiarity with.
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk