Re: Unicode forms for internal storage

Doug Ewell Wed, 21 Jan 2004 01:33:38 -0800

Elliotte Rusty Harold <elharo at metalab dot unc dot edu> wrote:

>>          BZZZT!  Sorry, thanks for playing.  You can't get the
>> advantages of both with no drawbacks.  Given the octets 0x5B5B, how
>> would you know if you had "[[" or a Chinese character?
>
> Actually, it looks like SCSU may do exactly that. If I'm
> understanding the algorithms, it actually encodes most BMP characters
> in a single byte, compressing quite a bit better than my naive idea
> to switch between UTF-8 and UTF-16.


I too missed the point in Elliotte's original post that it was OK for
this transformation to be stateful.  Since that is the case, SCSU
definitely will fit the bill.

> All schemes I've seen do involve some sort of flag characters in the
> data stream to switch between different code ranges. As long as you
> can keep the number of flag characters added down below the savings,
> you're good to go. My original idea was to simply use a null to
> switch between ASCII and UTF-16. SCSU looks a lot more sophisticated.

SCSU *can be* a lot more sophisticated, but as Markus noted, a subset of
full-blown SCSU will often achieve really good compression.

> Of course, neither of those schemes will compress truly random data,
> but most data isn't random.

No scheme will compress truly random data, at least not consistently.

>>          Hmmm - again, this may be asking for too much.  The
>> UTF-8/UTF-16 transform is pretty simple.  Is it bogging you down?
>
> It is a noticeable point in my profiling. I really did have to make a
> choice between speed and space here. According to
> http://www.unicode.org/notes/tn6/#Performance it looks like SCSU is
> faster for a lot of languages but 10-25% slower for English, French
> and Japanese than the UTF-8/UTF-16 conversion.

If you are using the "mini" version of SCSU where Latin-1 characters are
stored as 1 byte each and everything else is stored as UTF-16 (using SCU
and UC0 tags to switch between modes), you ought to achieve really good
speed.

> If space usage is random/indeterminate/evenly distributed, then,
> assuming that any given string is primarily in a single language, a
> TLV type discriminating between UTF-8 and UTF-16 should do nicely.
> Precede each string with an OR of the MSB (0 for UTF-8, 1 for UTF-16)
> and the length, in octets, of the string (therefore max of 32,767
> octets per string, which shouldn't ordinarily be a problem).
>
> That would be a problem. I definitely cannot rule out long strings,
> where long is quite a bit larger than 32K.

Despite the often-stated claims that SCSU and BOCU-1 are "optimized for
short strings," they work just as well on arbitrarily long strings.
It's just that the performance of general-purpose compression schemes
gets *much* better as the input text gets larger, so the relative
benefit of SCSU and BOCU-1 (compared to GP compression) is greatly
reduced.  But for an internal-storage need like Elliotte's, and
especially where speed and simplicity are important, the compression
formats look like winners.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Unicode forms for internal storage

Reply via email to