On 24/11/2003 07:52, Mark E. Shoulson wrote:

On 11/24/03 01:26, Doug Ewell wrote:

So the question becomes:  Is it legitimate for a Unicode compression
engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
another (canonically equivalent) normalization form to improve its
compressibility?

OK, this *is* a fascinating question. ...

...


It seems to me that there is some kind of mixing of levels here. At one level, we have a text which consists of a string of Unicode characters, and this is the string which can be normalised or denormalised (in fact any transformation preserving canonical equivalence) at will. At a lower level, we have a sequence of bytes or whatever in a Unicode encoding form. And at a still lower level we have a sequence of bytes, which, at this level, have no known interpretation. And it is surely at this level that lossless compression should operate. Now such a compression scheme may receive and use information from a higher level that the byte stream is in a particular encoding form of Unicode, and may make use of that information as a hint. But it should take this as nothing more than a hint, not necessarily reliable, and preserve the byte stream through compression and decompression.

If conformance clause C10 is taken to be operable at all levels, this makes a nonsense of the concept of normalisation stability within databases etc. If a low level process is permitted to make any canonically equivalent transformation, then there can be no guarantee that data which is stored in a particular normalisation form is retrievable in that same normalisation form, for maybe a low level compression or other process has transformed the data on the disk or tape or on its way to or from it.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to