May be a fourth level of abstraction is needed to complete what the MIME
registry describes as "charsets": a TES (Transfer Encoding Syntax) sometimes
happen at end, and some legacy specifications of CES mix it with what should
have been left in a separate TES.

For example, the specification of SCSU (Simple Compression Scheme for
Unicode) defines it as a way to convert a stream of code points directly to
a stream of bytes, without going through the level of abstraction of
intermediate "code units" (or in this case, code units are simply the
encoded bytes).

This makes SCSU a legal CEF (like are UTF-32, UTF-16 and UTF-8) to convert a
stream of encoded characters into a stream of (8-bit) code units, and a
legal CES (like are UTF-32BE or UTF-32LE or UTF-16BE or UTF16-LE or UTF-8 or
CESU-8, and UTF-16 or UTF-32 or UTF-8 or CESU-8 with a leading BOM) to take
into account the generated byte order.

But the SCSU specification speaks about "optional extensions" which are
probably badly named because they should be better described as TES
(DLE-escaping for NUL and DLE, or run-length compression, or COBS encoding),
exactly like other well-known TES (Base64, Quoted-Printable) widely used in
MIME contexts.

I think that there still exists some other legacy charsets in the MIME
registry that mix these level of abstraction, where a clear separation
between CES and TES levels would have helped their interoperability. One
cause of this descrepancy is that it has been, since long easier to create a
new charset and have it registered in the long MIME registry, than to define
a clear TES separately (the TES registry in MIME is not extremely long, and
support for multiple TES in applications has often been very weak and not
easily extensible, developers prefering to develop first the support needed
to handle correctly the so many possible CES, just identified by their MIME
"charset" identifier).

The other related "problem" of TES is that many document structures
(including XML) only offer a place to specify the "charset" (i.e. the result
of a combination of a CCS, CEF and CES), but no place to specify the TES,
which is left, apparently, to the transport protocol, ignoring the case of
local storage where identification of TES is nearly impossible to make
reliably... This means that local stores cannot benefit easily of the
advantages of a TES specification (for example, when creating a reference to
a text document, it's impossible to specify in the link that this document
has been COBS-encoded or Base-64 encoded or even compressed in deflate or
gzipped form, unless the local document is stored in an enveloppe format,
such as a RFC2118 message with headers, and there's support in the hyperlink
renderer to decode this enveloppe format transparently).

For now, a hyperlink can specify the MIME-type of the document with an
attribute specifying the "charset", i.e. the triplet <CCS,CES,CEF>, but no
reliable and documented attribute to specify its TES (unless the document is
transported via a email or with HTTP, and the source makes the job on the
fly to transform it to the desired TES, which is a CPU-intensive job for
servers that could be avoided if documents could be stored or cached
directly by the server in their TES-encoded form; this means support in the
server's storage to keep this out-of-band information).

There does exists solutions but they are not universal and interoperable
across distinct softwares working with the same physical document store:
some filesystems offer that support with out-of-band meta-data, some servers
will use private conventions with multiple file extensions and private
server configuration files...

If the document's TES encoding decoding could be handled directly by the
client, without dependance of the underlying transport or storage
technology, it would be easier.

TES encoding is really out of scope of Unicode, but its support in various
applications using encoded text documents should be enhanced. This includes
a support for it in the XML and HTML document syntax, notably within source
hyperlinks.

As a final note: multiple TES encoding stages may be chained in any
transport or storage, and changed on the fly across nodes in a transport
network, without affecting the charset used for the decoded document. But in
many applications, including HTTP, only one TES can be specified (else it
will break other features such as document content signature and
certification). I know no working implementation of any transport protocol
that transparently allows specifying these multiple TES encodings (most
often these steps are possible only in distinct layers of the transport
architecture, where it can be made transparent for the applications handling
encoded documents on the upper layers). This means that TES
encoding/decoding affects the performance (and reliability...) of each
relaying node in a transport network (such as proxies), a caveat avoided by
including TES within a MIME charset, so that no TES encoding (or more
precisely just a identity, do-nothing, "8-bit" TES encoding) will be
necessary in the relaying chain...

----- Original Message ----- 
From: "John Tisdale" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, August 18, 2004 5:27 AM
Subject: MSDN Article, Second Draft


Thanks everyone for your helpful feedback on the first draft of the MSDN
article. I couldn't fit in all of the suggestions as the Unicode portion is
only a small piece of my article. The following is the second draft based on
the corrections, additional information and resources provided.

Also, I would like to get feedback on the most accurate/appropriate term/s
for describing the CCS, CEF and CES (layers, levels, components, etc.)?

I am under a tight deadline and need to collect any final feedback rather
quickly before producing the final version.

Special thanks to Asmus for investing a lot of his time to help.


Reply via email to