[EMAIL PROTECTED] wrote: > Further, a Unicode-aware algorithm would expect a choseong character to > be followed by a jungseong and a jongseong to follow a jungsong, and > could essentially perform the same benefits to compression that > normalising to NFC perfroms but without making an irreversible change > (i.e. it could tokenise the Jamo sequences rather than normalising and > then tokenising).
Isn't it equivalent to what bzip2 does, but without knowledge of Unicode composition rules, simply by discovering that jamos are structured within their syllables, and creating, on the fly code positions to represent their composition ? A 2% difference can be explained by the fact that bzip2 must still discover the new "clusters" by encoding them first in their decomposed form before using codes to represent the composed forms for the rest of the text. > > Whether a "silent" normalization to NFC can be a legitimate part of > > Unicode compression remains in question. I notice the list is still > > split as to whether this process "changes" the text (because checksums > > will differ) or not (because C10 says processes must consider the text > > to be equivalent). And what about a compressor that would identify the source as being Unicode, and would convert it first to NFC, but including composed forms for the compositions normally excluded from NFC? This seems marginal but some languages would have better compression results when taking these canonically equivalent compositions into account, such as pointed Hebrew and Arabic. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>