> Further, a Unicode-aware algorithm would expect a choseong character to 
> be followed by a jungseong and a jongseong to follow a jungsong, and 
> could essentially perform the same benefits to compression that 
> normalising to NFC perfroms but without making an irreversible change 
> (i.e. it could tokenise the Jamo sequences rather than normalising and 
> then tokenising).

Isn't it equivalent to what bzip2 does, but without knowledge of Unicode 
composition rules, simply by discovering that jamos are structured 
within their syllables, and creating, on the fly code positions to 
represent their composition ?

A 2% difference can be explained by the fact that bzip2 must still 
discover the new "clusters" by encoding them first in their decomposed 
form before using codes to represent the composed forms for the rest of 
the text.

> > Whether a "silent" normalization to NFC can be a legitimate part of
> > Unicode compression remains in question.  I notice the list is still
> > split as to whether this process "changes" the text (because checksums
> > will differ) or not (because C10 says processes must consider the text
> > to be equivalent).

And what about a compressor that would identify the source as being 
Unicode, and would convert it first to NFC, but including composed forms 
for the compositions normally excluded from NFC? This seems marginal but 
some languages would have better compression results when taking these 
canonically equivalent compositions into account, such as pointed Hebrew 
and Arabic.

<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

Reply via email to