Test files - bug 12897

Leif Halvard Silli Mon, 06 Jun 2011 14:43:34 -0700

See: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897


The purpose of this message is to publish two (for all purposes) 
identical test files, so that they can be consumed as - for the xml 
file - as XHTML with the Content-Type 'application/xhtml+xml; 
charset=ISO-8859-1' and - for the html file - as HTML with the 
Content-Type 'text/html; charset=ISO-8859-1'.

* Both files have the Byte Order Mark (unless the mailing list software 
strips it).
* Both files are UTF-8 encoded.
* Which encoding the web server present them as, depends on how Apache 
is configured, but hopefully they will pick up the suffix '.iso8859-1' 
and thus serve them as ISO-8859-1 encoded.
* Likewise, Apache hopefully picks up the last file suffix, which are 
.xhtml and .html respectively.

xml.html.iso8859-1.xhtml
Description: application/xhtml

Title: UTF-8 encoded HTML document with the BOM + erroneous external encoding

ï»¿

HTML test: UTF-8 encoded document with erroneous external encoding

Test 1: Character gibberish:: Ã¦Ã¸Ã¥ ÃÃÃ Ã¶Ã¼Ã¿ ÃÃÅ¸ ÐÐÐ Ð°Ð±Ð² Ð¯Ð®Ð ÑÑÐ¶
Test 2: CSS box model error: If this document is interpreted as HTML, then in Firefox and Opera you can see the effect of the Quirks-Mode on these to elements:
Reference: this element is always 100 pixels wide.

The width attribute for this element is lacking unit information. In no-quirks mode, it will thus fill the entire width of the screen. Otherwise, it will be 100 pixels wide.

This HTML-compatible XHTML document, is encoded with the UTF-8 encoding and is also given a character encoding signature in the form of a Byte Order Mark (BOM). However, in contrast to this, the HTTP Content-Type: header coming from the Web server, claims (such is a least the plan ...) that the encoding of this document is ISO-8859-1.

For situations where two layers specifies different encoding, then XML 1.0 appendix F.2 recommends:

In the interests of interoperability, however, the following rule is recommended.

If an XML entity is in a file, the Byte-Order Mark and encoding declaration are used (if present) to determine the character encoding.

For HTML, then at least Internet Explorer 8 and Webkit (Safari, Chrome) behave as recommended for XML 1.0: They respect the BOM more than they respect the HTTP Content-Type: header. They also respect the BOM more than a user's possible attempt to override the encoding, and for Webkit this goes for both XML and HTML. (I have not tested Internet Explorer version 9.)

For XML, then Opera and Firefox do not respect the BOM as much as the XML specification recommends. As a consquense, in face of an XML document with erroneous encoding info inside the HTTP Content-Type: header, then Firefox and Opera fires a draconian error messsage. For instance, this document has a HTTP Content-Type: header which says "ISO-8859-1", which - when this lable is respected, leads the parser to see some illegal characters befor the DOCTYPE. In contrast, Webkit browsers, which respect the XML recommendation, they do not display any draconian error message.

For HTML, again, the mis-interpretation of Opera and Firefox leads them to see 3 illegal characters before the DOCTYPE, which in turns sends them into quirks mode - this is an important reason for why user interaction and HTTP should be ignored whenever there is a BOM.


Leif Halvard Silli

Test files - bug 12897

HTML test: UTF-8 encoded document with erroneous external encoding

Reply via email to