On May 26, 2004, at 2:02 AM, Nicholas Clark wrote:

On Tue, May 25, 2004 at 07:48:32PM -0700, Jeff Clites wrote:
On May 25, 2004, at 12:26 PM, Dan Sugalski wrote:

Yup. UTF8 is Just another variable-width encoding. Do anything with it
and we convert it to a fixed-width encoding, in this case UTF32.

This has the unfortunate side-effect of wasting 50-75% of the storage space in the common cases, of course.

True. But variable length encodings suck performance wise.

Yes--that was the point I made previously in this thread. But my proposed scheme was neither variable length nor egregiously wasteful of space.


The only thing that might be useful to cache on a UTF8 string is the highest
code point seen, so that we know whether to unpack to 8, 16 or 32 bit without
a scan. Presumably we can find this when we input validate on the
"conversion" from binary to UTF8.

This is basically what I implemented.

JEff



Reply via email to