Re: string encoding

Hong Zhang Fri, 16 Feb 2001 11:55:32 -0800
> >People in Japan/China/Korea have been using multi-byte encoding for
> >long time. I personally have used it for more 10 years. I never feel
> >much of the "pain". Do you think I are using my computer with O(n)
> >while you are using it with O(1)? There are 100 million people using
> >variable-length encoding!!!
>
> Not at this level they aren't. The people actually writing the code do
feel
> the pain, and you do pay a computational price. You can't *not* pay the
price.
>
>    substr($foo, 233253, 14)
>
> is going to cost significantly more with variable sized characters than
> fixed sized ones.

I don't believe so. The problem is you assume the character position at
the very beginning. Where you get the value of 233253 and 14. Hereby
I will show an example of how to decode "Context-Length: 1000" into
name value pair using multi-byte encoding. The code is in C syntax.

   char* str = "Content-Length: 1000";
   int idx = indexof(str, ": "); /* sort of strstr() */
   char* name = strndup(str, idx);
   char* value = strdup(str + idx + strlen(": "));

If you go through C string functions plus XXXprintf(). Most of them, if
not all, are O(n).

> >Take this example, in Chinese every character has the same width, so
> >it is very easy to format paragraphs and lines. Most English web pages
> >are rendered using "Times New Roman", which is a variable-width font.
> >Do you think the English pages are rendered O(n) while Chinese page
> >are rendered O(1)?
>
> You need a better example, since that one's rather muddy.

The example is not good. How about find the cursor position when you
click in the middle of a Word document? Fix width font will be fast
than variable one. Right?

> >As I said there are many more hard problems than UTF-8. If you want
> >to support i18n and l10n, you have to live with it.
>
> No, we don't. We do *not* have to live with it at all. That UTF-8 is a
> variable-length representation is an implementation detail, and one we are
> not required to live with internally. If UTF-16 (which is also variable
> width, annoyingly) or UTF-32 (which doesn't officially exist as far as I
> can tell, but we can define by fiat) is better for us, then great. They're
> all just different ways of representing Unicode abstract characters. (I
> think--I'm only up to chapter 3 of the unicode 3.0 book)
>
> Besides, I think you're arguing a completely different point, and I think
> it's been missed generally. Where we're going to get bit hard, and I can't
> see a way around, is combining characters.

My original argument is to use UTF-8 as the internal representation of
string.
Given the complexity of i18n and l10n, most text processing jobs can be done
as efficiently using UTF-8 as using UTF-32, unless you want to treat them as
binary. Most text process are using linear algorithms anyway.

Hong
Re: string encoding

Reply via email to