Elliotte Rusty Harold <elharo at metalab dot unc dot edu> wrote: >> BZZZT! Sorry, thanks for playing. You can't get the >> advantages of both with no drawbacks. Given the octets 0x5B5B, how >> would you know if you had "[[" or a Chinese character? > > Actually, it looks like SCSU may do exactly that. If I'm > understanding the algorithms, it actually encodes most BMP characters > in a single byte, compressing quite a bit better than my naive idea > to switch between UTF-8 and UTF-16.
I too missed the point in Elliotte's original post that it was OK for this transformation to be stateful. Since that is the case, SCSU definitely will fit the bill. > All schemes I've seen do involve some sort of flag characters in the > data stream to switch between different code ranges. As long as you > can keep the number of flag characters added down below the savings, > you're good to go. My original idea was to simply use a null to > switch between ASCII and UTF-16. SCSU looks a lot more sophisticated. SCSU *can be* a lot more sophisticated, but as Markus noted, a subset of full-blown SCSU will often achieve really good compression. > Of course, neither of those schemes will compress truly random data, > but most data isn't random. No scheme will compress truly random data, at least not consistently. >> Hmmm - again, this may be asking for too much. The >> UTF-8/UTF-16 transform is pretty simple. Is it bogging you down? > > It is a noticeable point in my profiling. I really did have to make a > choice between speed and space here. According to > http://www.unicode.org/notes/tn6/#Performance it looks like SCSU is > faster for a lot of languages but 10-25% slower for English, French > and Japanese than the UTF-8/UTF-16 conversion. If you are using the "mini" version of SCSU where Latin-1 characters are stored as 1 byte each and everything else is stored as UTF-16 (using SCU and UC0 tags to switch between modes), you ought to achieve really good speed. > If space usage is random/indeterminate/evenly distributed, then, > assuming that any given string is primarily in a single language, a > TLV type discriminating between UTF-8 and UTF-16 should do nicely. > Precede each string with an OR of the MSB (0 for UTF-8, 1 for UTF-16) > and the length, in octets, of the string (therefore max of 32,767 > octets per string, which shouldn't ordinarily be a problem). > > That would be a problem. I definitely cannot rule out long strings, > where long is quite a bit larger than 32K. Despite the often-stated claims that SCSU and BOCU-1 are "optimized for short strings," they work just as well on arbitrarily long strings. It's just that the performance of general-purpose compression schemes gets *much* better as the input text gets larger, so the relative benefit of SCSU and BOCU-1 (compared to GP compression) is greatly reduced. But for an internal-storage need like Elliotte's, and especially where speed and simplicity are important, the compression formats look like winners. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/