Moved to -unicode, because that's what it's *for*.
On Fri, Feb 16, 2001 at 01:17:03PM -0800, Hong Zhang wrote:
> > substr's already been mentioned.
>
> I have already given the counter argument. The codepoint position is useless
> in many cases. They should be deprecated.
Uh? That doesn't make sense. Codepoint position is *exactly* what people
expect when they use substr. When I say
$a = substr($b,10);
I want the 10th character. If I get the 10th byte, and we're using UTF-8
as you suggest, I might be cutting into the middle of a character, leaving
the resulting string malformed. That's horrific.
> I designed and implemented most of Java regular expression. I don't feel
> significant different between UTF-8 and UTF-16. Under some cases UTF-8
> are much better.
Try telling Jarkko. :)
> I agree the constant will be higher and code will not be easy. But I don't
> believe that will be a significant problem. It is just a small problem
> of dealing with Unicode.
No problem vs. problem.
I know which I'd choose.
> My understand the chop() can be very efficient under common cases, "abc\n",
> for both UTF-8 and UTF-32.
What about in the case of "abc\x{1F1E}"? UTF-32 or UTF-16 here is *vastly*
more efficient than UTF-8.
> The s/.// case is misleading too. If you define . as [^\n], the UTF-8 and
> UTF-32 will have exactly the same performance
No, no, no, no, no, no, no.
UTF-16 case: Remove first two bytes
UTF-8 case: Examine first byte, determine character width, remove n bytes.
Now do that n times, and tell me which is more efficient.
This is not "exactly the same", by any stretch of the imagination.
> Another example is m/S/i. The Unicode case mapping is one-to-many and
> many-to-one, especially considering locale. Neither UTF-8 or UTF-32
> will save you.
That's irrelevant. The efficiency-significant part is skipping through the
string, and knowing *exactly* how far you need to skip ahead is much more
efficient than having to stop and recalculate it for each character.
I really cannot understand how to express this any simpler or any more
persuasively.
UTF16 : s += 2; : O(1) : Good
UTF8 : s += UTF8WIDTH(*s) : O(n) : Bad
--
Going to church does not make a person religious, nor does going to school
make a person educated, any more than going to a garage makes a person a car.