On Monday, September 28, 2015 at 3:32:55 AM UTC-4, Jameson wrote: > > I find it interesting to note that the wikipedia article points out that > if size compression is the goal (and there is enough text for it to > matter), then SCSU (or other attempts at creating a unicode-specific > compression scheme) is inferior to using a general purpose compression > algorithm. Since the entropy* of the data is independent of its encoding, > the size of the compressed data should also be fairly independent of the > encoding. > > *entropy is a measure of the amount of "information" contained in a block > of data. > > Under optimal compression, the size of the data should equal the entropy. > Using https://en.wikipedia.org/wiki/Entropy_%28information_theory%29 > <https://www.google.com/url?q=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FEntropy_%2528information_theory%2529&sa=D&sntz=1&usg=AFQjCNHOU2ZC5nspAQuYtwlcn2G2DnYH2g> > > as my reference, the typical english texts encoded in ASCII can be stored > in ~1 bit / character (e.g. utf8 has 700% overhead over the optimal > encoding scheme). At this level of excess over actual compression, there > should not be not much point to the argument over whether 700% or 1500% > bloat is "better". >
Theory is fine, until you have to do something in the real world. Even with a very large file (1GB) of highly compressible XML of mostly English text, your typical general purpose compression schemes don't generally get better than 1/2 to 1/4 the original size, and the encoding/decoding can take large amounts of memory for their dictionaries. (note, if you really think you can do better, there is 50,000 euros of prize money waiting for you) See http://mattmahoney.net/dc/text.html. A general purpose compression algorithm really doesn't do very well when the average length of what you are compressing is less than 16 characters. If you are interested, please read http://www.unicode.org/notes/tn14/UnicodeCompression.pdf, which has a good discussion of both Unicode specific compression schemes (such as BOCU-1, SCSU) vs. general purpose compression schemes. In particular, read the last paragraph or two of page 12, and the conclusions at the end. The Unicode specific scheme I came up with back in the '90s (which is probably more heavily used than either BOCU-1 or SCSU, but is proprietary), uses run-length encoding and packs sequences of digits (think of what you'd see in a CSV file, for example), and so achieves higher compression ratios than either BOCU-1 or SCSU. Note, I have nothing against using general purpose compression schemes, they work pretty well when compressing whole files, or whole blocks or chunks of data (before encryption, of course!), say when moving blocks from cache to disk and vice-versa, it just doesn't help at all for compressing short Unicode sequences quickly.