I find it interesting to note that the wikipedia article points out that if
size compression is the goal (and there is enough text for it to matter),
then SCSU (or other attempts at creating a unicode-specific compression
scheme) is inferior to using a general purpose compression algorithm. Since
the entropy* of the data is independent of its encoding, the size of the
compressed data should also be fairly independent of the encoding.

*entropy is a measure of the amount of "information" contained in a block
of data.

Under optimal compression, the size of the data should equal the entropy.
Using https://en.wikipedia.org/wiki/Entropy_%28information_theory%29 as my
reference, the typical english texts encoded in ASCII can be stored in ~1
bit / character (e.g. utf8 has 700% overhead over the optimal encoding
scheme). At this level of excess over actual compression, there should not
be not much point to the argument over whether 700% or 1500% bloat is
"better".


On Sun, Sep 27, 2015 at 10:47 PM Scott Jones <scott.paul.jo...@gmail.com>
wrote:

> The ANSI Latin 1 character set, which is equivalent to the 1st 256
> characters of the Unicode character set,
> supports the following languages: Western Europe and Americas: Afrikaans,
> Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French,
> Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish
> and Swedish.
>
> If you store things like Python 3, those will all be stored in 1 byte per
> character.
> In UTF-8, those will use 2 bytes per character for characters between
> 128-255.
>
> Things like Greek, Arabic (most of it, at least), Hebrew, Cyrillic will
> take 2 bytes per character in UTF-8.
>
> Since UTF-8 takes so much space when dealing with text from a large part
> of the world's languages
> (and languages used by > 60% of the world population, by my estimates), in
> the past, I had to come up with packing schemes (that were designed for
> optimizing space, not ease of processing) for efficiently storing Unicode
> text in a database, which other people have also done (see BOCU-1 & SCSU).
>
> Scott
>
>

Reply via email to