Re: string encoding

Hong Zhang Fri, 16 Feb 2001 13:13:08 -0800
> Then you would be incorrect. To find the character at position 233253 in a
> variable-length encoding requires scanning the string from the beginning,
> and has a rather significant potential cost. You've got a test for every
> character up to that point with a potential branch or two on each one.
> You're guaranteed to blow the heck out of your processor's D-cache, since
> you've just waded through between 200 and 800K of data that's essentially
> meaningless for the operation in question.

The concept of "character position" does not exist in many languages.
It is not necessary in many cases.

> Honestly? I picked them out of the air. They're there to demonstrate a
> point. They have no intrinsic meaning.
>
> >Hereby
> >I will show an example of how to decode "Context-Length: 1000" into
> >name value pair using multi-byte encoding. The code is in C syntax.
>
> While interesting, I don't see how that's relevant here. Perhaps I'm
> missing something.

My point is the multi-byte code is as efficient as single byte code.
Your "out of the air" example does not represent a practical case.
I welcome if you can show me a more real case.

> C's string functions suck even worse than its stdio system. Perl doesn't
> use them, and generally can't.

I don't believe so. Many C's string functions are intrinsic to compiler
and operating system. They are coded with extremely fancy assembly.
Please see the source code of string functions of Microsoft Visual C++.
I have seen one engineer working on memcpy() for Sun Ultra Sparc V
for quite sometime. It is non-trivial to replicate the work, and I
don't believe the Perl community has done it.

> No, but that's because Word (and word processors in general) has all sorts
> of stuff tagged to each character. Translating from screen position to
> character depends very little on the actual font you're using--it's close
> to lost in the noise of other things.

I just used it as metaphor. Just as you impled, lots of things can be done
to improve performance of UTF-8 too.

> >My original argument is to use UTF-8 as the internal representation of
> >string.
> >Given the complexity of i18n and l10n, most text processing jobs can be
done
> >as efficiently using UTF-8 as using UTF-32, unless you want to treat them
as
> >binary. Most text process are using linear algorithms anyway.
>
> Most, but not all. I'm sort of required to deal with all of them, as well
> as the resulting complexity of the code needed to deal with it. The fact
> that perl 5's regex engine turned from an evil, festering, mind-warping
> mound of code to an evil, festering, mind-warping mound of code that
> attacks small children with the switch to dealing with variable-length
> encoding's a good enough reason to not use it.

I don't believe the current perl regex engine is the best form it can be.
I have designed and implemented full-blown regex engine using UTF-8, and
it was not that complicated.

Hong
Re: string encoding

Reply via email to