Re: string encoding

Simon Cozens Fri, 16 Feb 2001 16:16:56 -0800
Moved to -unicode, because that's what it's *for*.

On Fri, Feb 16, 2001 at 01:17:03PM -0800, Hong Zhang wrote:
> > substr's already been mentioned.
> 
> I have already given the counter argument. The codepoint position is useless
> in many cases. They should be deprecated.

Uh? That doesn't make sense. Codepoint position is *exactly* what people
expect when they use substr. When I say 

    $a = substr($b,10);

I want the 10th character. If I get the 10th byte, and we're using UTF-8
as you suggest, I might be cutting into the middle of a character, leaving
the resulting string malformed. That's horrific.

> I designed and implemented most of Java regular expression. I don't feel
> significant different between UTF-8 and UTF-16. Under some cases UTF-8
> are much better.

Try telling Jarkko. :)

> I agree the constant will be higher and code will not be easy. But I don't
> believe that will be a significant problem. It is just a small problem
> of dealing with Unicode.

No problem vs. problem.

I know which I'd choose.

> My understand the chop() can be very efficient under common cases, "abc\n",
> for both UTF-8 and UTF-32. 

What about in the case of "abc\x{1F1E}"? UTF-32 or UTF-16 here is *vastly*
more efficient than UTF-8.

> The s/.// case is misleading too. If you define . as [^\n], the UTF-8 and
> UTF-32 will have exactly the same performance

No, no, no, no, no, no, no.

UTF-16 case: Remove first two bytes
UTF-8  case: Examine first byte, determine character width, remove n bytes.

Now do that n times, and tell me which is more efficient.

This is not "exactly the same", by any stretch of the imagination.

> Another example is m/S/i. The Unicode case mapping is one-to-many and
> many-to-one, especially considering locale. Neither UTF-8 or UTF-32
> will save you.

That's irrelevant. The efficiency-significant part is skipping through the
string, and knowing *exactly* how far you need to skip ahead is much more
efficient than having to stop and recalculate it for each character.

I really cannot understand how to express this any simpler or any more
persuasively.

UTF16 : s += 2;            : O(1) : Good
UTF8  : s += UTF8WIDTH(*s) : O(n) : Bad

-- 
Going to church does not make a person religious, nor does going to school
make a person educated, any more than going to a garage makes a person a car.
Re: string encoding

Reply via email to