Henri Sivonen on 2006-03-11:

I think it would be beneficial to additionally stipulate that
1. The meta element-based character encoding information declaration is expected to work only if the Basic Latin range of characters maps to the same bytes as in the US-ASCII encoding.
Is this realistic? I'm not really familiar enough with character encodings to say if this is what happens in general.
I suppose it is realistic. See below.

Yes, for most encodings, the US-ASCII range is the same, and if you restrict it a bit further (the "INVARIANT" charset in RFC 1345), it covers most of the ambiguous encodings. The others can be easily detected as they usually have very different bit patterns (EBCDIC) or word lengths (UTF-16, UTF-32).

2. If there is no external character encoding information nor a BOM (see below), there MUST NOT be any non-ASCII bytes in the document byte stream before the end of the meta element that declares the character encoding. (In practice this would ban unescaped non-ASCII class names on the html and [head] elements and non-ASCII comments at the beginning of the document.)
Again, can we realistically require this? I need to do some studies of non-latin pages, I guess.
As UA behavior, no. As a conformance requirement, maybe.

If you require browsers to switch on-the-fly, they can redo the decoding when they find the <meta> anyway, and this is no longer a problem. There are a lot of documents with non-ASCII-language comments and <title> tags that are positioned before the <meta>.

Authors should avoid including inline character encoding information. Character encoding information should instead be included at the transport level (e.g. using the HTTP Content-Type header).
I disagree.
With HTML with contemporary UAs, there is no real harm in including the character encoding information both on the HTTP level and in the meta as long as the information is not contradictory. On the contrary, the author-provided internal information is actually useful when end users save pages to disk using UAs that do not reserialize with internal character encoding information.
...and it breaks everything when you have a transcoding proxy, or similar.
Well, not until you save to disk, since HTTP takes precedence. However, authors can escape this by using UTF-8. (Assuming here that tampering with UTF-8 would be harmful, wrong and pointless.)

Interestingly, transcoding proxies tend to be brought up by residents of Western Europe, North America or the Commonwealth. I have never seen a Russion person living in Russia or a Japanese person living in Japan talk about transcoding proxies in any online or offline discussion. That's why I doubt the importance of transcoding proxies.

Transcoding is very popular, especially in Russia. With mod_charset in Apache it will (AFAICT) use the information in the <meta> of the document to determine the source encoding and then transcode it to an encoding it believes the client can handle (based on browser sniffing). It transcodes on a byte level, so the <meta> reamains unchanged, but is overridden by the HTTP header.

The <meta> tag is really information to the server, it is the server that is *supposed* to read it and post the data into the HTTP header. Unfortunately not many servers support that, leaving us with having to parse them in the browsers instead. Reading the <meta> tag for encoding information is basically at the same level as guessing the encoding by frequency analysis--The server didn't say anything so perhaps you can get lucky.

Character encoding information shouldn't be duplicated, IMHO, that's just asking for trouble.
I suggest a mismatch be considered an easy parse error and, therefore, reportable.

That will not work in the mod_charset case above.

For HTML, user agents must use the following algorithm in determining the
character encoding of a document:
1. If the transport layer specifies an encoding, use that.
Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only; UTF-32 makes no practical sense for interchange on the Web.)
I don't know, should there?
I believe there should.

BOM-sniffing should be done *after* looking at the transport layer's information. It might know something you don't. It's a part of the "guessing-the-content" step.

Requirements I'd like to see:

Documents must specify a character encoding an must use an IANA-registered encoding and must identify it using its preferred MIME name or use a BOM (with UTF-8, UTF-16 or UTF-32). UAs must recognize the preferred MIME name of every encoding they support that has a preferred MIME name. UAs should recognize IANA-registered aliases.

That could be useful, the only problem being that the IANA list of encoding labels is a bit difficult to read when you want to try figuring out which name to write.

Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE (i.e. BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the EBCDIC family of encodings. Documents using the UTF-16 or UTF-32 encodings must have a BOM.

I don't think forbidding BOCU-1 is a good idea. If there is ever a proper specification written of it, it could be very useful as a compression format for documents.

Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.)

Yes, especially since encoding defintions tend to change over time.

Authors are adviced to use the UTF-8 encoding. Authors are adviced not to use the UTF-32 encoding or legacy encodings. (Note: I think UTF-32 on the Web is harmful and utterly pointless, but Firefox and Opera support it.

UTF-32 can be useful as an internal format, but I agree that it's not very useful on the web. Not sure about the "harmful" bit, though.

--
\\//
Peter, software engineer, Opera Software

 The opinions expressed are my own, and not those of my employer.
 Please reply only by follow-ups on the mailing list.

Reply via email to