On May 26, 2004, at 2:02 AM, Nicholas Clark wrote:
On Tue, May 25, 2004 at 07:48:32PM -0700, Jeff Clites wrote:On May 25, 2004, at 12:26 PM, Dan Sugalski wrote:
Yup. UTF8 is Just another variable-width encoding. Do anything with it
and we convert it to a fixed-width encoding, in this case UTF32.
This has the unfortunate side-effect of wasting 50-75% of the storage space in the common cases, of course.
True. But variable length encodings suck performance wise.
Yes--that was the point I made previously in this thread. But my proposed scheme was neither variable length nor egregiously wasteful of space.
The only thing that might be useful to cache on a UTF8 string is the highest
code point seen, so that we know whether to unpack to 8, 16 or 32 bit without
a scan. Presumably we can find this when we input validate on the
"conversion" from binary to UTF8.
This is basically what I implemented.
JEff