On 11/24/03 01:26, Doug Ewell wrote:
OK, this *is* a fascinating question. ...So the question becomes: Is it legitimate for a Unicode compression engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into another (canonically equivalent) normalization form to improve its compressibility?
...
It seems to me that there is some kind of mixing of levels here. At one level, we have a text which consists of a string of Unicode characters, and this is the string which can be normalised or denormalised (in fact any transformation preserving canonical equivalence) at will. At a lower level, we have a sequence of bytes or whatever in a Unicode encoding form. And at a still lower level we have a sequence of bytes, which, at this level, have no known interpretation. And it is surely at this level that lossless compression should operate. Now such a compression scheme may receive and use information from a higher level that the byte stream is in a particular encoding form of Unicode, and may make use of that information as a hint. But it should take this as nothing more than a hint, not necessarily reliable, and preserve the byte stream through compression and decompression.
If conformance clause C10 is taken to be operable at all levels, this makes a nonsense of the concept of normalisation stability within databases etc. If a low level process is permitted to make any canonically equivalent transformation, then there can be no guarantee that data which is stored in a particular normalisation form is retrievable in that same normalisation form, for maybe a low level compression or other process has transformed the data on the disk or tape or on its way to or from it.
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/