Re: Ternary search trees for Unicode dictionaries

Doug Ewell Sun, 23 Nov 2003 18:37:29 -0800

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> OK, this is a transform, but it is still canonically equivalent to the
> source text. Transformations between canonical equivalent strings is
> safe (at least for Korean Hangul), and this is what any normalizer
> performs.


But compressors aren't normally expected to perform normalization too.

>> This adds complexity, but is sometimes worth the effort.
>
> Not more complex than implementing the standard normalization. In any
> case, most normalizers perform Hangul syllable normalization
> algorithmically, so it's a couple of lines to modify to add a check
> for possible TL/TLV composition in the NFC normalizer, or to decompose
> TL and TLV but not LV in the NFD normalizer.

Compressors aren't normally expected to perform normalization, even if
there is a simple algorithmic way to do it.

>> The Burrows-Wheeler block-sorting approach, for example, achieves
>> very good results by adding a preprocessing step before
>> "conventional" Huffman or arithmetic compression.
>
> I don't know what it is but I think it will try to create a unique
> code from common sequences of codepoints, in order to flatten
> approximately their distribution statistics. The assigned codes are
> then sorted by frequency, from where Huffman/Arithmetic encoding of
> these partially flattened codes will create the variable length codes,
> with a bit more accuracy than if the intermediate codes were all
> unsorted, the interest being that the result of Huffman/Arithmetic
> encoding will contain sequences of frequent bits set to 0, so that it
> is recompressible by a second Huffman/Arithmetic pass, or by a RLE
> pass.

You're right, you don't know what it is.  :-)

See ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz for
the actual description.

> It would be interesting to know if good SCSU compressors have been
> developed, i.e. how they compare when they choose one option or the
> other for the coding, between switch to UTF-16, temporary escape to
> UTF-16, management of the most frequent codes in the 8-bit codepage,
> switch of codepages, reset to default codepages...

I think you mean "windows" here.  SCSU does not deal with legacy code
pages the way, say, ISO 2022 does.

> I'm sure that many compressors have been attempted, but not fully
> experimented to get the best result. So it's quite soon to compare
> the performance of BOCU-1 and SCSU.

I know one person who's looking forward to reading my paper.

> After all, the "deflate" compression method in zlib is public, and all
> ZIP tools can defalte deflated files, but they are not all equal in
> terms of effective compression ratio (the fine tuning of compressors
> between processing speed at compression time and result size requires
> a lot of efforts, and this is where algorithms are proprietary and
> often patented, despite the decompressor itself is free and mostly
> identical between all products, with very few design choices except
> minor optimizatrions for speed of the decompression process).
>
> So we should wait for some time until those who developed the best
> tunings for "deflate" experiment their technics on SCSU. The current
> ICU implementation is not too bad, but after looking at some generated
> sequences, it is not always fully optimal, and I can still find
> shorter codes manually in some examples not too hard to find, notably
> in documents using mixed languages or scripts (I have not looked
> precisely on the case of Korean, but I have looked at the currently
> excellent result for Russian and Greek, with still some optimizations
> possible for Latin based languages that use a higher number of accents
> and use relatively more frequently the Latin Extended letters).

Hoo boy, do I ever know someone who's looking forward to reading my
paper.

> Of course, as BOCU-1 and SCSU are considered as encoding schemes, they
> cannot themselves alter the represented Unicode strings into
> canonically equivalent ones. But I'm quite sure that if a process wish
> to create strings that it knows they will be normalized when read (or
> the reader will ignore differences of canonically equivalent strings,
> something that all Unicode-based processes should respect), that
> process could take it into account within its BOCU-1 or SCSU encoder
> by allowing them to change the normalization form as long as canonical
> equivalence is kept. For BOCU-1 this is quite simple as this is a
> pipeline with separate tasks with little options, but in SCSU the
> normalization form of the input string interacts with the SCSU-1
> compressor during its run, and offers more alternatives to the
> compressor.

Changing a sequence S to a canonically equivalent sequence S' is
explicitly allowed by conformance clause C10, but I would be surprised
if any designers or implementers of Unicode compression schemes have
considered such a thing.  Does this violate users' expectations of what
"lossless" compression means?  This might be a topic all by itself.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Ternary search trees for Unicode dictionaries

Reply via email to