Re: [whatwg] Internal character encoding declaration

Peter Karlsson Tue, 14 Mar 2006 05:09:46 -0800

Henri Sivonen on 2006-03-14:

It appears that the INVARIANT charset is not designed to be invariantunder different Web-relevant encodings (e.g. stateful Asian encodings thatuse ESC and VISCII that assigns printable characters to the controlrange). Rather, the INVARIANT charset seems to be designed to be invariantunder the various national variants of ISO-646, which used to be relevantto email until ten years ago but luckily have never been relevant to theWeb.

True, but it is still relevant as a subset that is identical in mostencodings. If you drop the "+" from it, you can include UTF-7 in the coveredencodings as well, and I don't know of any IANA encoding labels with "+" inthem, meaning it can be used for <meta> discovery.

(BTW, how Web-relevant is VISCII, really?)


I can't remember having seen any web pages use it.

Transcoding is very popular, especially in Russia.
In *proxies* *today*? What's the point considering that browsers havesupported the Cyrillic encoding soup *and* UTF-8 for years?

The mod_charset is not proxying, it's on the server level. A few years backbrowsers that did not support many character encodings were still popular(according to the statistics I have seen), but that has likely changedlately. mod_charset still remains and is in use, however.

How could proxies properly transcode form submissions coming back withoutmessing everything up spectacularly?

That's why the "hidden-string" technique was invented. Introduce a hidden<input> with a character string that will get encoded differently dependingon the encoding used. When data comes in, use this character string todetermine what encoding was used.

I am aware of the Russian Apache project. A glance at the English docssuggests it is not reading the meta.

I haven't read the documentation, but I have seen pages being served indifferent character encodings in different browsers by Russian Apacheservers, with the <meta> intact and indicating the original encoding. It isquite possible that the <meta> wasn't used anywhere.

Not a fatal problem if the information on the HTTP layer is right (untilsaving to disk, that is).


Exactly.

Easy parse errors are not fatal in browsers. Surely it is OK for aconformance checker to complain that much at server operators whose HTTPlayer and meta do not match.

I just reacted at the notion of calling such documents invalid. It is thetransport layer that defines the encoding, whatever the document says or howit looks like is irrelevant, and is just something that you can look at ifthe transport layer neglects to say anything.

Is BOCU-1 so much smaller than UTF-8 with deflate compression on the HTTPlayer that the gratuitous incompatibility could ever be justified?

I don't know, I haven't compared (but you should of course compare BOCU-1with deflate if you do).


--
\\//
Peter, software engineer, Opera Software

 The opinions expressed are my own, and not those of my employer.
 Please reply only by follow-ups on the mailing list.

Re: [whatwg] Internal character encoding declaration

Reply via email to