Henri Sivonen on 2006-03-14:
It appears that the INVARIANT charset is not designed to be invariant
under different Web-relevant encodings (e.g. stateful Asian encodings that
use ESC and VISCII that assigns printable characters to the control
range). Rather, the INVARIANT charset seems to be designed to be invariant
under the various national variants of ISO-646, which used to be relevant
to email until ten years ago but luckily have never been relevant to the
Web.
True, but it is still relevant as a subset that is identical in most
encodings. If you drop the "+" from it, you can include UTF-7 in the covered
encodings as well, and I don't know of any IANA encoding labels with "+" in
them, meaning it can be used for <meta> discovery.
(BTW, how Web-relevant is VISCII, really?)
I can't remember having seen any web pages use it.
Transcoding is very popular, especially in Russia.
In *proxies* *today*? What's the point considering that browsers have
supported the Cyrillic encoding soup *and* UTF-8 for years?
The mod_charset is not proxying, it's on the server level. A few years back
browsers that did not support many character encodings were still popular
(according to the statistics I have seen), but that has likely changed
lately. mod_charset still remains and is in use, however.
How could proxies properly transcode form submissions coming back without
messing everything up spectacularly?
That's why the "hidden-string" technique was invented. Introduce a hidden
<input> with a character string that will get encoded differently depending
on the encoding used. When data comes in, use this character string to
determine what encoding was used.
I am aware of the Russian Apache project. A glance at the English docs
suggests it is not reading the meta.
I haven't read the documentation, but I have seen pages being served in
different character encodings in different browsers by Russian Apache
servers, with the <meta> intact and indicating the original encoding. It is
quite possible that the <meta> wasn't used anywhere.
Not a fatal problem if the information on the HTTP layer is right (until
saving to disk, that is).
Exactly.
Easy parse errors are not fatal in browsers. Surely it is OK for a
conformance checker to complain that much at server operators whose HTTP
layer and meta do not match.
I just reacted at the notion of calling such documents invalid. It is the
transport layer that defines the encoding, whatever the document says or how
it looks like is irrelevant, and is just something that you can look at if
the transport layer neglects to say anything.
Is BOCU-1 so much smaller than UTF-8 with deflate compression on the HTTP
layer that the gratuitous incompatibility could ever be justified?
I don't know, I haven't compared (but you should of course compare BOCU-1
with deflate if you do).
--
\\//
Peter, software engineer, Opera Software
The opinions expressed are my own, and not those of my employer.
Please reply only by follow-ups on the mailing list.