I find it interesting to note that the wikipedia article points out that if size compression is the goal (and there is enough text for it to matter), then SCSU (or other attempts at creating a unicode-specific compression scheme) is inferior to using a general purpose compression algorithm. Since the entropy* of the data is independent of its encoding, the size of the compressed data should also be fairly independent of the encoding.
*entropy is a measure of the amount of "information" contained in a block of data. Under optimal compression, the size of the data should equal the entropy. Using https://en.wikipedia.org/wiki/Entropy_%28information_theory%29 as my reference, the typical english texts encoded in ASCII can be stored in ~1 bit / character (e.g. utf8 has 700% overhead over the optimal encoding scheme). At this level of excess over actual compression, there should not be not much point to the argument over whether 700% or 1500% bloat is "better". On Sun, Sep 27, 2015 at 10:47 PM Scott Jones <scott.paul.jo...@gmail.com> wrote: > The ANSI Latin 1 character set, which is equivalent to the 1st 256 > characters of the Unicode character set, > supports the following languages: Western Europe and Americas: Afrikaans, > Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, > Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish > and Swedish. > > If you store things like Python 3, those will all be stored in 1 byte per > character. > In UTF-8, those will use 2 bytes per character for characters between > 128-255. > > Things like Greek, Arabic (most of it, at least), Hebrew, Cyrillic will > take 2 bytes per character in UTF-8. > > Since UTF-8 takes so much space when dealing with text from a large part > of the world's languages > (and languages used by > 60% of the world population, by my estimates), in > the past, I had to come up with packing schemes (that were designed for > optimizing space, not ease of processing) for efficiently storing Unicode > text in a database, which other people have also done (see BOCU-1 & SCSU). > > Scott > >