May be a fourth level of abstraction is needed to complete what the MIME registry describes as "charsets": a TES (Transfer Encoding Syntax) sometimes happen at end, and some legacy specifications of CES mix it with what should have been left in a separate TES.
For example, the specification of SCSU (Simple Compression Scheme for Unicode) defines it as a way to convert a stream of code points directly to a stream of bytes, without going through the level of abstraction of intermediate "code units" (or in this case, code units are simply the encoded bytes). This makes SCSU a legal CEF (like are UTF-32, UTF-16 and UTF-8) to convert a stream of encoded characters into a stream of (8-bit) code units, and a legal CES (like are UTF-32BE or UTF-32LE or UTF-16BE or UTF16-LE or UTF-8 or CESU-8, and UTF-16 or UTF-32 or UTF-8 or CESU-8 with a leading BOM) to take into account the generated byte order. But the SCSU specification speaks about "optional extensions" which are probably badly named because they should be better described as TES (DLE-escaping for NUL and DLE, or run-length compression, or COBS encoding), exactly like other well-known TES (Base64, Quoted-Printable) widely used in MIME contexts. I think that there still exists some other legacy charsets in the MIME registry that mix these level of abstraction, where a clear separation between CES and TES levels would have helped their interoperability. One cause of this descrepancy is that it has been, since long easier to create a new charset and have it registered in the long MIME registry, than to define a clear TES separately (the TES registry in MIME is not extremely long, and support for multiple TES in applications has often been very weak and not easily extensible, developers prefering to develop first the support needed to handle correctly the so many possible CES, just identified by their MIME "charset" identifier). The other related "problem" of TES is that many document structures (including XML) only offer a place to specify the "charset" (i.e. the result of a combination of a CCS, CEF and CES), but no place to specify the TES, which is left, apparently, to the transport protocol, ignoring the case of local storage where identification of TES is nearly impossible to make reliably... This means that local stores cannot benefit easily of the advantages of a TES specification (for example, when creating a reference to a text document, it's impossible to specify in the link that this document has been COBS-encoded or Base-64 encoded or even compressed in deflate or gzipped form, unless the local document is stored in an enveloppe format, such as a RFC2118 message with headers, and there's support in the hyperlink renderer to decode this enveloppe format transparently). For now, a hyperlink can specify the MIME-type of the document with an attribute specifying the "charset", i.e. the triplet <CCS,CES,CEF>, but no reliable and documented attribute to specify its TES (unless the document is transported via a email or with HTTP, and the source makes the job on the fly to transform it to the desired TES, which is a CPU-intensive job for servers that could be avoided if documents could be stored or cached directly by the server in their TES-encoded form; this means support in the server's storage to keep this out-of-band information). There does exists solutions but they are not universal and interoperable across distinct softwares working with the same physical document store: some filesystems offer that support with out-of-band meta-data, some servers will use private conventions with multiple file extensions and private server configuration files... If the document's TES encoding decoding could be handled directly by the client, without dependance of the underlying transport or storage technology, it would be easier. TES encoding is really out of scope of Unicode, but its support in various applications using encoded text documents should be enhanced. This includes a support for it in the XML and HTML document syntax, notably within source hyperlinks. As a final note: multiple TES encoding stages may be chained in any transport or storage, and changed on the fly across nodes in a transport network, without affecting the charset used for the decoded document. But in many applications, including HTTP, only one TES can be specified (else it will break other features such as document content signature and certification). I know no working implementation of any transport protocol that transparently allows specifying these multiple TES encodings (most often these steps are possible only in distinct layers of the transport architecture, where it can be made transparent for the applications handling encoded documents on the upper layers). This means that TES encoding/decoding affects the performance (and reliability...) of each relaying node in a transport network (such as proxies), a caveat avoided by including TES within a MIME charset, so that no TES encoding (or more precisely just a identity, do-nothing, "8-bit" TES encoding) will be necessary in the relaying chain... ----- Original Message ----- From: "John Tisdale" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, August 18, 2004 5:27 AM Subject: MSDN Article, Second Draft Thanks everyone for your helpful feedback on the first draft of the MSDN article. I couldn't fit in all of the suggestions as the Unicode portion is only a small piece of my article. The following is the second draft based on the corrections, additional information and resources provided. Also, I would like to get feedback on the most accurate/appropriate term/s for describing the CCS, CEF and CES (layers, levels, components, etc.)? I am under a tight deadline and need to collect any final feedback rather quickly before producing the final version. Special thanks to Asmus for investing a lot of his time to help.