Re: string encoding

Dan Sugalski Fri, 16 Feb 2001 12:35:29 -0800
At 12:20 PM 2/16/2001 -0800, Hong Zhang wrote:
> > >People in Japan/China/Korea have been using multi-byte encoding for
> > >long time. I personally have used it for more 10 years. I never feel
> > >much of the "pain". Do you think I are using my computer with O(n)
> > >while you are using it with O(1)? There are 100 million people using
> > >variable-length encoding!!!
> >
> > Not at this level they aren't. The people actually writing the code do
>feel
> > the pain, and you do pay a computational price. You can't *not* pay the
>price.
> >
> >    substr($foo, 233253, 14)
> >
> > is going to cost significantly more with variable sized characters than
> > fixed sized ones.
>
>I don't believe so.

Then you would be incorrect. To find the character at position 233253 in a 
variable-length encoding requires scanning the string from the beginning, 
and has a rather significant potential cost. You've got a test for every 
character up to that point with a potential branch or two on each one. 
You're guaranteed to blow the heck out of your processor's D-cache, since 
you've just waded through between 200 and 800K of data that's essentially 
meaningless for the operation in question.

>The problem is you assume the character position at
>the very beginning.

Well, the problem is that I'm assuming characters period. I have this 
nagging feeling that primary assumption is flawed.

>Where you get the value of 233253 and 14.

Honestly? I picked them out of the air. They're there to demonstrate a 
point. They have no intrinsic meaning.

>Hereby
>I will show an example of how to decode "Context-Length: 1000" into
>name value pair using multi-byte encoding. The code is in C syntax.

While interesting, I don't see how that's relevant here. Perhaps I'm 
missing something.

>If you go through C string functions plus XXXprintf(). Most of them, if
>not all, are O(n).

C's string functions suck even worse than its stdio system. Perl doesn't 
use them, and generally can't.

> > >Take this example, in Chinese every character has the same width, so
> > >it is very easy to format paragraphs and lines. Most English web pages
> > >are rendered using "Times New Roman", which is a variable-width font.
> > >Do you think the English pages are rendered O(n) while Chinese page
> > >are rendered O(1)?
> >
> > You need a better example, since that one's rather muddy.
>
>The example is not good. How about find the cursor position when you
>click in the middle of a Word document? Fix width font will be fast
>than variable one. Right?

No, but that's because Word (and word processors in general) has all sorts 
of stuff tagged to each character. Translating from screen position to 
character depends very little on the actual font you're using--it's close 
to lost in the noise of other things.

> > >As I said there are many more hard problems than UTF-8. If you want
> > >to support i18n and l10n, you have to live with it.
> >
> > No, we don't. We do *not* have to live with it at all. That UTF-8 is a
> > variable-length representation is an implementation detail, and one we are
> > not required to live with internally. If UTF-16 (which is also variable
> > width, annoyingly) or UTF-32 (which doesn't officially exist as far as I
> > can tell, but we can define by fiat) is better for us, then great. They're
> > all just different ways of representing Unicode abstract characters. (I
> > think--I'm only up to chapter 3 of the unicode 3.0 book)
> >
> > Besides, I think you're arguing a completely different point, and I think
> > it's been missed generally. Where we're going to get bit hard, and I can't
> > see a way around, is combining characters.
>
>My original argument is to use UTF-8 as the internal representation of
>string.
>Given the complexity of i18n and l10n, most text processing jobs can be done
>as efficiently using UTF-8 as using UTF-32, unless you want to treat them as
>binary. Most text process are using linear algorithms anyway.

Most, but not all. I'm sort of required to deal with all of them, as well 
as the resulting complexity of the code needed to deal with it. The fact 
that perl 5's regex engine turned from an evil, festering, mind-warping 
mound of code to an evil, festering, mind-warping mound of code that 
attacks small children with the switch to dealing with variable-length 
encoding's a good enough reason to not use it.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: string encoding

Reply via email to