[whatwg] Internal character encoding declaration

Henri Sivonen Mon, 08 Aug 2005 11:42:56 -0700

Quoting from WA1 draft section 2.2.5.1. Specifying and establishing thedocument's character encoding:

The meta element may also be used, in HTML only (not in XHTML) toprovide UAs with character encoding information for the file. To dothis, the meta element must be the first element in the head element,

To cater for implementations that consume the byte stream only once inall cases and do not rewind the input and restart the parser upondiscovering the meta, I think it would be beneficial to additionallystipulate that1. The meta element-based character encoding information declaration isexpected to work only if the Basic Lating range of characters maps tothe same bytes as in the US-ASCII encoding.2. If there is no external character encoding information nor a BOM(see below), there MUST NOT be any non-ASCII bytes in the document bytestream before the end of the meta element that declares the characterencoding. (In practice this would ban unescaped non-ASCII class nameson the html and body elements and non-ASCII comments at the beginningof the document.)

it must have the http-equiv attribute set to the literal valueContent-Type,

I think case-insensitivity should be allowed in the string"Content-Type", because there is legacy precedent for that and HTTPdefines header names as case-insensitive.

and must have the content attribute set to the literal valuetext/html; charset=

That string should be case-insensitive as well, because HTTP defines itcase-insensitive. Also, should zero or more white space characters beallowed before ';' and around '=' and should the space after ';' be oneor more white space characters? HTTP-wise yes, but would it lead toreal-world incompatibilities? (I have not tested.)

immediately followed by the character encoding, which must be a validcharacter encoding name. [IANACHARSET] When the meta element is usedin this way, there must be no other attributes set on the element.Other than for giving the document's character encoding in this way,the http-equiv attribute must not be used.
In XHTML, the XML declaration should be used for inline characterencoding information.


Excellent.

Authors should avoid including inline character encoding information.Character encoding information should instead be included at thetransport level (e.g. using the HTTP Content-Type header).


I disagree.

With HTML with contemporary UAs, there is no real harm in including thecharacter encoding information both on the HTTP level and in the metaas long as the information is not contradictory. On the contrary, theauthor-provided internal information is actually useful when end userssave pages to disk using UAs that do not reserialize with internalcharacter encoding information.

With XML, there is a robust method for identifying the characterencoding internally. When the encoding is explicit, the sniffing isalso interoperably implemented. (Unfortunately, for the BOMlessimplicit case, see http://bugzilla.opendarwin.org/show_bug.cgi?id=3809. Gecko used to have the same bug.) RFC 3023's insistence on declaringthe encoding authoritatively outside the XML byte stream itself is, inmy opinion, as silly as insisting on declaring the compression methodof a zip archive authoritatively on the HTTP level instead of using theinformation stored in the file.

The TAG has found "Thus there is no ambiguity when the charset isomitted, and the STRONGLY RECOMMENDED injunction [of RFC 3023] to usethe charset is misplaced for application/xml and for non-text "+xml"types." (http://www.w3.org/2001/tag/2004/0430-mime.html#char-encoding).

For HTML, user agents must use the following algorithm in determiningthe character encoding of a document:
1. If the transport layer specifies an encoding, use that.

Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only;UTF-32 makes no practical sense for interchange on the Web.)

2. Otherwise, if the user agent can find a meta element that specifiescharacter encoding information (as described above), then use that.

If a conformance checker has not determined the character encoding bynow, what should it do? Should it report the document as non-conforming(my preferred choice)? Should it default to US-ASCII and report anynon-ASCII bytes as conformance errors? Should it continue to thefuzzier steps like browsers would (hopefully not)?

3. Otherwise, if the user agent can autodetect the character encodingfrom applying frequency analysis or other algorithms to the datastream, then use that.4. Otherwise, use an implementation-defined or user-specified defaultcharacter encoding (ISO-8859-1, windows-1252, and UTF-8 arerecommended as defaults, and can in many cases be identified byinspection as they have different ranges of valid bytes).

I think it does not make sense to recommend ISO-8859-1, becausewindows-1252 is always a better guess in practice. In the context ofHTML, UTF-8 looks like a weird default considering years of precedentwith the de facto windows-1252 default. (Of course, if the UA iswilling to examine the entire byte stream before parsing, UTF-8 can bedetected very reliably.)


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

[whatwg] Internal character encoding declaration

Reply via email to