Re: [julia-users] Re: UTF16String or UTF8String?

Scott Jones Mon, 28 Sep 2015 01:44:32 -0700


On Monday, September 28, 2015 at 3:32:55 AM UTC-4, Jameson wrote:
>
> I find it interesting to note that the wikipedia article points out that 
> if size compression is the goal (and there is enough text for it to 
> matter), then SCSU (or other attempts at creating a unicode-specific 
> compression scheme) is inferior to using a general purpose compression 
> algorithm. Since the entropy* of the data is independent of its encoding, 
> the size of the compressed data should also be fairly independent of the 
> encoding.
>
> *entropy is a measure of the amount of "information" contained in a block 
> of data.
>
> Under optimal compression, the size of the data should equal the entropy. 
> Using https://en.wikipedia.org/wiki/Entropy_%28information_theory%29 
> <https://www.google.com/url?q=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FEntropy_%2528information_theory%2529&sa=D&sntz=1&usg=AFQjCNHOU2ZC5nspAQuYtwlcn2G2DnYH2g>
>  
> as my reference, the typical english texts encoded in ASCII can be stored 
> in ~1 bit / character (e.g. utf8 has 700% overhead over the optimal 
> encoding scheme). At this level of excess over actual compression, there 
> should not be not much point to the argument over whether 700% or 1500% 
> bloat is "better".
>


Theory is fine, until you have to do something in the real world.
Even with a very large file (1GB) of highly compressible XML of mostly 
English text, your typical general purpose compression schemes don't 
generally get better than 1/2 to 1/4 the original size, and the 
encoding/decoding can take large amounts of memory for their dictionaries.
(note, if you really think you can do better, there is 50,000 euros of 
prize money waiting for you)
See http://mattmahoney.net/dc/text.html.

A general purpose compression algorithm really doesn't do very well when 
the average length of what you are compressing is less than 16 characters.
If you are interested, please 
read http://www.unicode.org/notes/tn14/UnicodeCompression.pdf, which has a 
good discussion of both Unicode specific compression schemes (such as 
BOCU-1, SCSU) vs. general purpose compression schemes.
In particular, read the last paragraph or two of page 12, and the 
conclusions at the end.
The Unicode specific scheme I came up with back in the '90s (which is 
probably more heavily used than either BOCU-1 or SCSU, but is proprietary), 
uses run-length encoding and packs sequences of digits (think of what you'd 
see in a CSV file, for example), and so achieves higher compression ratios 
than either BOCU-1 or SCSU.

Note, I have nothing against using general purpose compression schemes, 
they work pretty well when compressing whole files, or whole blocks or 
chunks of data (before encryption, of course!), say when moving blocks from 
cache to disk and vice-versa, it just doesn't help at all for compressing 
short Unicode sequences quickly.

Re: [julia-users] Re: UTF16String or UTF8String?

Reply via email to