Re: Bit ops on strings

Jeff Clites Wed, 26 May 2004 18:55:40 -0700

On May 26, 2004, at 2:02 AM, Nicholas Clark wrote:

On Tue, May 25, 2004 at 07:48:32PM -0700, Jeff Clites wrote:
On May 25, 2004, at 12:26 PM, Dan Sugalski wrote:
Yup. UTF8 is Just another variable-width encoding. Do anything with it and we convert it to a fixed-width encoding, in this case UTF32.
This has the unfortunate side-effect of wasting 50-75% of the storage
space in the common cases, of course.
True. But variable length encodings suck performance wise.

Yes--that was the point I made previously in this thread. But my proposed scheme was neither variable length nor egregiously wasteful of space.

The only thing that might be useful to cache on a UTF8 string is the highest code point seen, so that we know whether to unpack to 8, 16 or 32 bit without a scan. Presumably we can find this when we input validate on the "conversion" from binary to UTF8.


This is basically what I implemented.

JEff

Re: Bit ops on strings

Reply via email to