Quoting from WA1 draft section 2.2.5.1. Specifying and establishing the document's character encoding:

The meta element may also be used, in HTML only (not in XHTML) to provide UAs with character encoding information for the file. To do this, the meta element must be the first element in the head element,

To cater for implementations that consume the byte stream only once in all cases and do not rewind the input and restart the parser upon discovering the meta, I think it would be beneficial to additionally stipulate that 1. The meta element-based character encoding information declaration is expected to work only if the Basic Lating range of characters maps to the same bytes as in the US-ASCII encoding. 2. If there is no external character encoding information nor a BOM (see below), there MUST NOT be any non-ASCII bytes in the document byte stream before the end of the meta element that declares the character encoding. (In practice this would ban unescaped non-ASCII class names on the html and body elements and non-ASCII comments at the beginning of the document.)

it must have the http-equiv attribute set to the literal value Content-Type,

I think case-insensitivity should be allowed in the string "Content-Type", because there is legacy precedent for that and HTTP defines header names as case-insensitive.

and must have the content attribute set to the literal value text/html; charset=

That string should be case-insensitive as well, because HTTP defines it case-insensitive. Also, should zero or more white space characters be allowed before ';' and around '=' and should the space after ';' be one or more white space characters? HTTP-wise yes, but would it lead to real-world incompatibilities? (I have not tested.)

immediately followed by the character encoding, which must be a valid character encoding name. [IANACHARSET] When the meta element is used in this way, there must be no other attributes set on the element. Other than for giving the document's character encoding in this way, the http-equiv attribute must not be used.

In XHTML, the XML declaration should be used for inline character encoding information.

Excellent.

Authors should avoid including inline character encoding information. Character encoding information should instead be included at the transport level (e.g. using the HTTP Content-Type header).

I disagree.

With HTML with contemporary UAs, there is no real harm in including the character encoding information both on the HTTP level and in the meta as long as the information is not contradictory. On the contrary, the author-provided internal information is actually useful when end users save pages to disk using UAs that do not reserialize with internal character encoding information.

With XML, there is a robust method for identifying the character encoding internally. When the encoding is explicit, the sniffing is also interoperably implemented. (Unfortunately, for the BOMless implicit case, see http://bugzilla.opendarwin.org/show_bug.cgi?id=3809 . Gecko used to have the same bug.) RFC 3023's insistence on declaring the encoding authoritatively outside the XML byte stream itself is, in my opinion, as silly as insisting on declaring the compression method of a zip archive authoritatively on the HTTP level instead of using the information stored in the file.

The TAG has found "Thus there is no ambiguity when the charset is omitted, and the STRONGLY RECOMMENDED injunction [of RFC 3023] to use the charset is misplaced for application/xml and for non-text "+xml" types." (http://www.w3.org/2001/tag/2004/0430-mime.html#char-encoding).

For HTML, user agents must use the following algorithm in determining the character encoding of a document:
1. If the transport layer specifies an encoding, use that.

Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only; UTF-32 makes no practical sense for interchange on the Web.)

2. Otherwise, if the user agent can find a meta element that specifies character encoding information (as described above), then use that.

If a conformance checker has not determined the character encoding by now, what should it do? Should it report the document as non-conforming (my preferred choice)? Should it default to US-ASCII and report any non-ASCII bytes as conformance errors? Should it continue to the fuzzier steps like browsers would (hopefully not)?

3. Otherwise, if the user agent can autodetect the character encoding from applying frequency analysis or other algorithms to the data stream, then use that. 4. Otherwise, use an implementation-defined or user-specified default character encoding (ISO-8859-1, windows-1252, and UTF-8 are recommended as defaults, and can in many cases be identified by inspection as they have different ranges of valid bytes).

I think it does not make sense to recommend ISO-8859-1, because windows-1252 is always a better guess in practice. In the context of HTML, UTF-8 looks like a weird default considering years of precedent with the de facto windows-1252 default. (Of course, if the UA is willing to examine the entire byte stream before parsing, UTF-8 can be detected very reliably.)

--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

Reply via email to