Compression through normalization (was: Re: Ternary search trees)

Doug Ewell Mon, 24 Nov 2003 00:15:15 -0800

I wrote in response to Philippe Verdy:

> Changing a sequence S to a canonically equivalent sequence S' is
> explicitly allowed by conformance clause C10, but I would be surprised
> if any designers or implementers of Unicode compression schemes have
> considered such a thing.  Does this violate users' expectations of
> what "lossless" compression means?  This might be a topic all by
> itself.


The more I think about this, the more it intrigues me.  I can't decide
what is the "right thing" to do here, or if there is a "right thing" at
all.

It certainly seems true that under some circumstances, some samples of
Unicode text could be made more compressible by converting them to a
different normalization form -- either NFC, NFD, or a "partial" form
like the one Philippe described.

For example, consider the Korean name "Kwan" encoded as jamos (NFD):

U+110F U+116A U+11AB

Assuming that this (very) short string appeared at the beginning of a
text stream, it would be encoded as follows:

UTF-8:     E1 84 8F E1 85 AA E1 86 AB
SCSU:      1B 22 8F EA 1C 23 AB
BOCU-1:    E1 79 BA D0 38

(Note in the SCSU example that two dynamic windows have to be defined,
for the half-blocks at U+1100 and U+1180.  This is a one-time cost;
subsequent references to these blocks within the same stream would use
the same windows.)

Now, this same name "Kwan" could be equivalently encoded (NFC) as the
syllable U+CF74.  Even though none of the three formats shown here is
optimized for efficient representation of Hangul syllables, they would
all see a big improvement compared to the jamos (again assuming this is
at the beginning of a stream):

UTF-8:     EC BD B4
SCSU:      0F CF 74
BOCU-1:    FB BC 10

For longer, more realistic texts, SCSU would stay in Unicode mode and
use 2 bytes per syllable, while BOCU-1 would take advantage of
relatively short 2-byte (and occasional 1-byte) jumps within the Hangul
Syllables super-block.  Spaces and CR/LF pairs between syllables give an
edge to BOCU-1, while other interspersed Basic Latin characters, such as
full stops and other punctuation, give a (less frequent) edge to SCSU.
(Remember a day ago, I wondered why BOCU-1 did so much better on Korean
text encoded as syllables?  Now I know.)

Regardless of the CES or TES, however, it is clear that modern Korean
can be expressed more compactly using precomposed syllables (NFC) than
using jamos (NFD).

So the question becomes:  Is it legitimate for a Unicode compression
engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
another (canonically equivalent) normalization form to improve its
compressibility?

Conformance clause C10 explicitly allows this:

"When a process purports not to modify the interpretation of a valid
coded character representation, it shall make no change to that coded
character representation *other than the possible replacement of
character sequences by their canonical-equivalent sequences* or the
deletion of noncharacter code points."  (TUS 4.0, p. 60; emphasis mine)

However, in the non-text compression world there is a definite
distinction between "lossy" and "lossless" compression methods.  In the
former, it is understood that a certain amount of information from the
original source may be discarded, in the interest of improving
compression.  JPEG is the best-known example of a lossy image format;
image information may be modified or discarded according to compression
needs and the limitations of human vision.

In the latter, there is a built-in promise that the compression process
will not discard any information.  GIF is an example of an image format
that uses lossless compression; a 256-color bitmap, converted to GIF and
then back to its original format, should be bit-for-bit equivalent to
the original data (except for metadata used in the original format but
not in GIF, such as the "important colors" field in Windows BMP).

Converting Unicode data to another normalization form seems somewhere in
between these two cases.  Conformance clause C10 tells us that a process
can replace NFD data with its NFC equivalent and still claim "not to
modify" its interpretation; there is no "loss of data" in the sense of a
bitmap being converted to JPEG.  Yet there is no bit-for-bit equivalence
either; for a given text T, there is no promise that:

decompress(compress(T)) â T

If a compressor were to take advantage of the potential to improve
compression through normalization, would end users feel that the data
had been changed in an unacceptable way?  Suppose a compression format
(SCSU or BOCU-1) is being used with a protocol that demands a particular
normalization form.  Would there be a conformance problem?  What about
any other "process" as described by C10 that feeds into another
protocol?  Can a higher-level protocol supersede C10 and *require* a
given normalization form?  The note under C10 says:

"All processes and higher-level protocols are required to abide by C10
as a minimum."

which implies that they may not.

I'm especially interested in what the compression and conformance
experts have to say about this, but of course the thoughtful input of
any list member is very much welcome.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Compression through normalization (was: Re: Ternary search trees)

Reply via email to