Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote: > I say YES only for compressors that are supposed to work on Unicode > text (this applies to BOCU-1 and SCSU which are not intented to > compress anything else than Unicode text), but NO of course for > general purpose compressors (like deflate in zip files.)
Of course. > I will say NO for encoding forms that are normally built to be > directly parsable code point by codepoint in any direction and from > random locations in strings. So a UTF encoding scheme is not supposed > to change the normalization form. Of course not. Or so I would imagine, anyway. After all, if a process (see Peter Kirk's question) that compresses Unicode text can silently change the normalization form, then why not a process that stores and retrieves Unicode text using, say, UTF-8? But that sounds wrong to me, although it's what C10 says. >> * Peter Kirk and Mark Shoulson say NO, it can't, because all the >> compressor really knows about is the byte stream, so it must be >> preserved byte-for-byte. > > But SCSU and BOCU-1 do not operate in the byte stream level, as their > use is invalid on random streams of bytes, but only defined in terms > of streams of code units... That's right. I tend to agree with the NO camp not because SCSU and BOCU-1 are going to be applied to arbitrary binary data, but because the *format* in which text is stored isn't normally expected to change the contents. Converting Unicode text from UTF-16LE to UTF-16BE, or UTF-16 to UTF-8, changes the bits. Everyone can see that. But the *code units* represented by those bits are not changed. If the UTF-16BE sequence <00 61 03 01> were converted to the UTF-8 sequence <C3 A1>, that would be a change not only in the bits, but in the code units as well. This is where the question lies. > That's why I won't say that SCSU and BOCU-1 are really compressors, > but rather really encoding schemes (CES in the ISO10646 terminology). They are transfer encoding syntaxes (TES). And I believe this terminology is from Unicode, not 10646, though I could be wrong. I would say encoders for SCSU and BOCU-1 are compressors. They're just not general-purpose compressors. > In fact the result of BOCU-1 and SCSU encoding schemes can create a > file which has its own charset (i.e. CCS+CES in the ISO terminology), > and thus can also have its own label for MIME usage or in XML charset > declarations. This is not a limitation, as true compressors can still > be used if needed from this encoding scheme, or transparently within > transport layers (such as the "Content-Transfer-Encoding:" in MIME and > HTTP applications). Yes, you can take SCSU- or BOCU-1-encoded text and recompress it using a GP compression scheme. Atkin and Stansifer's paper from last year is all about that, and I spend a few pages on it in my paper as well. You can also re-Zip a Zip file, though, so I don't know what that proves about the compression formats. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/