Re: string encoding

Hong Zhang Fri, 16 Feb 2001 12:53:11 -0800
> substr's already been mentioned.

I have already given the counter argument. The codepoint position is useless
in many cases. They should be deprecated.

> Regular expressions. Perl does rather a lot of them. We've already found
from
> Perl 5 development that they get nasty when variable length data is
involved.

I designed and implemented most of Java regular expression. I don't feel
significant different between UTF-8 and UTF-16. Under some cases UTF-8
are much better.

> And it's not so much that you get O(1) access, but also the fact that the
> constant is lower. chop(), s/.// and the like are much more efficient
*and*
> much easier to code if you know how many bytes you're taking off
beforehand.

I agree the constant will be higher and code will not be easy. But I don't
believe that will be a significant problem. It is just a small problem
of dealing with Unicode.

My understand the chop() can be very efficient under common cases, "abc\n",
for both UTF-8 and UTF-32. It does not work at all in face of "abc\r\n",
surrogate, combining character, hungul conjoined, etc.

The s/.// case is misleading too. If you define . as [^\n], the UTF-8 and
UTF-32 will have exactly the same performance, probably with much better
cache locality. If you defined . as
[^\u000A\u000B\u000C\u000D\u0085\u0028\u0029] as recommended by Unicode,
neither will be very efficient.

Another example is m/S/i. The Unicode case mapping is one-to-many and
many-to-one, especially considering locale. Neither UTF-8 or UTF-32
will save you.

My experience told me that UTF-8 plus byte index (or character iterator
in pure OO language) is the best trade-off, because it is compact
and is the default encoding of XML.

Hong
Re: string encoding

Reply via email to