RE: Unicode forms for internal storage

Elliotte Rusty Harold Tue, 20 Jan 2004 14:37:34 -0800

At 10:26 AM -0800 1/20/04, Mike Ayers wrote:

BZZZT! Sorry, thanks for playing. You can't get the advantages of both with no drawbacks. Given the octets 0x5B5B, how would you know if you had "[[" or a Chinese character?

Actually, it looks like SCSU may do exactly that. If I'm understanding the algorithms, it actually encodes most BMP characters in a single byte, compressing quite a bit better than my naive idea to switch between UTF-8 and UTF-16.

All schemes I've seen do involve some sort of flag characters in the data stream to switch between different code ranges. As long as you can keep the number of flag characters added down below the savings, you're good to go. My original idea was to simply use a null to switch between ASCII and UTF-16. SCSU looks a lot more sophisticated.

Of course, neither of those schemes will compress truly random data, but most data isn't random.

 However, I would like the translation into and out of this format to
 be at least as fast as the translation between UTF-8 and UTF-16 the
 class is currently performing on every call to setValue and getValue,
 ideally faster.

Hmmm - again, this may be asking for too much. The UTF-8/UTF-16 transform is pretty simple. Is it bogging you down?

It is a noticeable point in my profiling. I really did have to make a choice between speed and space here. According to http://www.unicode.org/notes/tn6/#Performance it looks like SCSU is faster for a lot of languages but 10-25% slower for English, French and Japanese than the UTF-8/UTF-16 conversion.

If your application will use much more of European or non-European languages, then just use UTF-8 or UTF-16 respectively, as you won't really lose much space that way.

This is a class library which is relatively language neutral. If a Chinese programmer uses it, I'd expect they'd have a lot of data in Chinese. So far most of the adoption that I know about is in the Americas and Europe, but there's no reason it has to stay that way, especially if I can reduce the footprint for CJK text.

If space usage is random/indeterminate/evenly distributed, then, assuming that any given string is primarily in a single language, a TLV type discriminating between UTF-8 and UTF-16 should do nicely. Precede each string with an OR of the MSB (0 for UTF-8, 1 for UTF-16) and the length, in octets, of the string (therefore max of 32,767 octets per string, which shouldn't ordinarily be a problem).

That would be a problem. I definitely cannot rule out long strings, where long is quite a bit larger than 32K.

--

Elliotte Rusty Harold [EMAIL PROTECTED] Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA

RE: Unicode forms for internal storage

Reply via email to