Re: string encoding

Dan Sugalski Thu, 15 Feb 2001 20:34:09 -0800
At 05:09 PM 2/15/2001 -0800, Hong Zhang wrote:
> > ...and because of this you can't randomly access the string, you are
> > reduced to sequential access (*).  And here I thought we could have
> > left tape drives to the last millennium.
> >
> > (*) Yes, of course you could cache your sequential access so you only
> > need to do it once, and build balanced trees and whatnot out of those
> > offsets to have random access emulated in O(n lg n), but as soon as
> > you update the string, you have to update the tree, or whatever data
> > structure you chose.  Pain, pain, pain.
>
>People in Japan/China/Korea have been using multi-byte encoding for
>long time. I personally have used it for more 10 years. I never feel
>much of the "pain". Do you think I are using my computer with O(n)
>while you are using it with O(1)? There are 100 million people using
>variable-length encoding!!!

Not at this level they aren't. The people actually writing the code do feel 
the pain, and you do pay a computational price. You can't *not* pay the price.

   substr($foo, 233253, 14)

is going to cost significantly more with variable sized characters than 
fixed sized ones.

>Take this example, in Chinese every character has the same width, so
>it is very easy to format paragraphs and lines. Most English web pages
>are rendered using "Times New Roman", which is a variable-width font.
>Do you think the English pages are rendered O(n) while Chinese page
>are rendered O(1)?

You need a better example, since that one's rather muddy. It's a matter of 
characters per word, not pixels per character. But generally speaking, 
Chinese pages will be rendered with less computational cost associated with 
the layout than pages with variable-width characters.

>As I said there are many more hard problems than UTF-8. If you want
>to support i18n and l10n, you have to live with it.

No, we don't. We do *not* have to live with it at all. That UTF-8 is a 
variable-length representation is an implementation detail, and one we are 
not required to live with internally. If UTF-16 (which is also variable 
width, annoyingly) or UTF-32 (which doesn't officially exist as far as I 
can tell, but we can define by fiat) is better for us, then great. They're 
all just different ways of representing Unicode abstract characters. (I 
think--I'm only up to chapter 3 of the unicode 3.0 book)

Besides, I think you're arguing a completely different point, and I think 
it's been missed generally. Where we're going to get bit hard, and I can't 
see a way around, is combining characters. The individual Unicode abstract 
characters can have a fixed-width representation, but the number of Unicode 
characters per 'real' character is variable, and I can't see any way around 
that. (It looks like it's legal to stack four or six modifier characters on 
a base character, and I don't think I'm willing to go so far as to use 
UTF-128 internally. That's a touch much, even for me... :) Then there also 
seems to be metadata embedded in the Unicode standard--stuff like the 
bidirectional ordering and alternate formatting characters. Bleah.

It looks like, for us to do Unicode properly with all the world's 
languages, we might have to have a tagged text format like we've been 
talking about for other things (XML and suchlike stuff). And I'm not 
anywhere near sure what we should do for substitutions. If you have the 
sequence:

   LATIN SMALL LETTER A, COMBINING TILDE

and do a s/a/b/, should you then have

   LATIN SMALL LETTER B, COMBINING TILDE

and if not, if you do a s/SMALL LETTER A WITH TILDE/q/ on the sequence, 
should you end up with

   LATIN SMALL LETTER Q

or not? The original sequence was two separate characters and the match was 
one, but they are really the same thing.

Unicode is making my head hurt. I do *not* have an appropriate language 
background to feel comfortable with this, even for the pieces that are 
relevant to the languages I have any familiarity with.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: string encoding

Reply via email to