Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
So we would be in a case where it's impossible to warranty full compatiblity or interoperability between the two concurrent standards from the same standard body, and promissing the best interoperoperability with past flavors of HTML (those past flavors are still not in the past given that two of them are definitely not deprecated for now but fully recommended, and HTML5 is still with the draft status). HTML5 would contradict with everyone else, and only HTML5. But I still think that the discriminant factor for HTML5 is its exclusive (and **mandatory**) document declaration: if it is absent for any reason there's absolutely NO reason to continue using an HTML5 parser, and browsers must then either: - fallback to using another legacy parser, or - use an HTML5 parser (if this is the only one you have) working in a more lenient mode, to recognize at least the XML prolog and the legacy SGML document declaration for HTML or XHTML), and at lest recognize the encoding in the XML prolog when it is present. This second option (more lenient parsing by the HTML5 parser) should be documented and vecome part of this future standard (still not finalized). In my opinion the XML parser is definitely not a legacy parser, it is present in all browsers for lots of services and applications. And it is even needed to support HTML5 in its XHTML serialization syntax (which is explicitly supported). For me, it is normal that the Unicorn validator does not integrate HTML5, given its draft status. So there's still a separate validator (which is also working in beta version, given the draft status of HTML5) which is still not integrable in Unicorn. But given the huge developments already made on the web with HTML5, it becomes urgent to fix these interoperability issues, before the final release of HTML5 : the existing major browsers are already modified constantly to follow the state of this draft, it will not be difficult for them to implement the missing interoperability rules and the sonner it will be done, the sooner webdesigners will be guided. (And in this case, the Beta nu validator of the W3C will start being integrable in Unicorn which remains the best validator from everything ; nu cannot be trusted for now, and it does not even return a conformance logo in its result, given that conformance rules are still not fully tested and specified in HTML5). HTML5 remains for now an important project, but is still not a standard by itself. 2012/11/28 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no Philippe Verdy, Wed, 28 Nov 2012 11:02:45 +0100: In this case, Firefox and IE should not even be able to render *any* XHTML page because it violates the HTML5 standard. (1) The page in question (http://www.xn--elqus623b.net/XKCD/1137.html) is (from a source code point of view) a pure XHTML page, and contains no HTML-compatible methods for declaring the encoding. And therefore, that page does indeed violate the HTML5 standard, with the result that browsers are permitted to fall back to their built-in default encodings. (2) According to XML, the XML prologue can be deleted for UTF-8 encoded pages. And when it is deleted/omitted, XML parsers assume that the page is UTF-8 encoded. And if you try that (that is: if you *do* delete the XML prologue from that page), then you will see that the Unicorn validator will *continue* to stamp that Web page as error free. This is because the Unicorn validator only considers the rules for XML - it doesn't consider the rules of HTML. (4) Also, when you do delete the XML prologue, then not only Firefox and IE will render the page in the wrong encoding, but even Safari. However, Opera and Chrome will continue to render the page as UTF-8 due to the UTF-8 sniffing that they cleverly have built in. Clearly, Opera and Chrome's behaviour is the way to go. (5) It is indeed backwards that the W3C Unicorn validator doesn't inform its users when their pages fail to include a HTML-compatible method for declaring the encoding. This suboptimal validation could partly be related to libxml2, which Unicorn is partly based on. Because - as it turns out - the command line tool xmllint (which is part of libxml2) shows a very similar behaviour to that of Unicorn: It pays no respect to the fact that the MIME type (or Content-Type:) is 'text/html' and not an XML MIME type. In fact, when you do delete the XML prologue, Unicorn issues this warning (you must click to make it visible): No Character Encoding Found! Falling back to UTF-8. Which is a quite confusing message to send given that HTML parser does not, as their last resort, fall back to UTF-8. -- leif halvard silli
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
Philippe Verdy, Thu, 29 Nov 2012 10:11:13 +0100: So we would be in a case where it's impossible to warranty full compatiblity or interoperability between the two concurrent standards from the same standard body, and promissing the best interoperoperability with past flavors of HTML (those past flavors are still not in the past given that two of them are definitely not deprecated for now but fully recommended, and HTML5 is still with the draft status). Section 5.1 of XHTML 1.0 says: [1] 'XHTML Documents which follow the guidelines set forth in Appendix C, HTML Compatibility Guidelines may be labeled with the Internet Media Type text/html' And Appendix C, point 9 of XHTML 1.0 says: [2] 'the best approach is to ensure that the web server provides the correct headers. If this is not possible, a document that wants to set its character encoding explicitly must include [ snip ] a meta http-equiv statement (e.g., meta http-equiv=Content-type content=text/html; charset=EUC-JP /).' For me, it is normal that the Unicorn validator does not integrate HTML5, given its draft status. The strange thing is that Unicorn doesn't integrate XHTML1. [1][2] [1] http://www.w3.org/TR/xhtml1/#media [2] http://www.w3.org/TR/xhtml1/#C_9 -- leif halvard silli
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
You're wrong. XHTML1 is integrated in the W3C validator and recognized automatically. The document you cite in the XHTML1 specs has just not been updated. http://validator.w3.org/check?uri=http%3A%2F%2Fwww.xn--elqus623b.net%2FXKCD%2F1137.htmlcharset=%28detect+automatically%29doctype=Inlinegroup=0 Anyway this http://www.xn--elqus623b.net/XKCD/1137.html site is actually using XHTML1.1 (in its strict schema, not a transitional schema) 2012/11/29 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no Philippe Verdy, Thu, 29 Nov 2012 10:11:13 +0100: So we would be in a case where it's impossible to warranty full compatiblity or interoperability between the two concurrent standards from the same standard body, and promissing the best interoperoperability with past flavors of HTML (those past flavors are still not in the past given that two of them are definitely not deprecated for now but fully recommended, and HTML5 is still with the draft status). Section 5.1 of XHTML 1.0 says: [1] 'XHTML Documents which follow the guidelines set forth in Appendix C, HTML Compatibility Guidelines may be labeled with the Internet Media Type text/html' And Appendix C, point 9 of XHTML 1.0 says: [2] 'the best approach is to ensure that the web server provides the correct headers. If this is not possible, a document that wants to set its character encoding explicitly must include [ snip ] a meta http-equiv statement (e.g., meta http-equiv=Content-type content=text/html; charset=EUC-JP /).' For me, it is normal that the Unicorn validator does not integrate HTML5, given its draft status. The strange thing is that Unicorn doesn't integrate XHTML1. [1][2] [1] http://www.w3.org/TR/xhtml1/#media [2] http://www.w3.org/TR/xhtml1/#C_9 -- leif halvard silli
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
Philippe Verdy, Thu, 29 Nov 2012 13:26:28 +0100: You're wrong. XHTML1 is integrated in the W3C validator and recognized automatically. Indeed, yes. What I meant by doesn't integrate XHTML1' was that Unicorn doesn't 100% adhere to the two sections of XHTML1 that I quoted.[1][2] The document you cite in the XHTML1 specs has just not been updated. The validator must of course implement what XHTML1 says. Anyway this http://www.xn--elqus623b.net/XKCD/1137.html site is actually using XHTML1.1 (in its strict schema, not a transitional schema) A relevant point, of course. But XHTML11 says the same thing: [3] 'XHTML 1.1 documents SHOULD be labeled with the Internet Media Type application/xhtml+xml as defined in [RFC3236]. For further information on using media types with XHTML, see the informative note [XHTMLMIME].' The XHTMLMIME note says: [4] 'The 'text/html' media type [RFC2854] is primarily for HTML, not for XHTML. In general, this media type is NOT suitable for XHTML except when the XHTML is conforms to the guidelines in Appendix A.' [5] 'DO set the encoding via a meta http-equiv statement in the document (e.g., meta http-equiv=Content-Type content=text/html; charset=utf-8 /)' [1] http://www.w3.org/TR/xhtml1/#media [2] http://www.w3.org/TR/xhtml1/#C_9 [3] http://www.w3.org/TR/xhtml11/xhtml11.html#strict [4] http://www.w3.org/TR/xhtml-media-types/#text-html [5] http://www.w3.org/TR/xhtml-media-types/#C_9 -- leif halvard silli
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
And you forget the important part of Appendix A: *Consequence*: Remember, however, that when the XML declaration is not included in a document, AND the character encoding is not specified by a higher level protocol such as HTTP, the document can only use the default character encodings UTF-8 or UTF-16. See, however, guideline 9http://www.w3.org/TR/xhtml-media-types/#C_9 below. Here we have an XHTML site that is already encoded with the default UTF-8. There's no reason then for Firefox or IE to render it with windows-1252, even if they ignore the XML prolog. the text/html content-type remains appropriate for XHTML 1.0, 1.1 or 5.0. The other Content-Type is text/xml+xhtml and similar types for integrating other XML schemas, but it is only appropriate if you need another schema than just XHTML, or if you want to integrate the support for an external or internal non-standard DTD, or you want to integrate the support for XML processing instructions (including XML schemas not used here, or XML stylesheets, which is the case here for rendering its technical code when viewing the source but not for rendering the described page content itself). The problem here is the guideline 9 which is not part of the standard, and which uses one of the worst part of HTML, meta elements ; this was partly ill-designed as an empty element, and that binds the content-type to override it and forces the reparsing from start, after parsing all or part of other required elements (html, body, head, title). But why ? Isn't UTF-8 (or alternatively UTF-16) already the default encoding of XHTML? If not, then we should file a bug in the W3C Validator for not honoring the Guideline 9 (even though it is not part of the standard itself, but just a recommendation, it should issue at least a warning). 2012/11/29 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no Philippe Verdy, Thu, 29 Nov 2012 13:26:28 +0100: You're wrong. XHTML1 is integrated in the W3C validator and recognized automatically. Indeed, yes. What I meant by doesn't integrate XHTML1' was that Unicorn doesn't 100% adhere to the two sections of XHTML1 that I quoted.[1][2] The document you cite in the XHTML1 specs has just not been updated. The validator must of course implement what XHTML1 says. Anyway this http://www.xn--elqus623b.net/XKCD/1137.html site is actually using XHTML1.1 (in its strict schema, not a transitional schema) A relevant point, of course. But XHTML11 says the same thing: [3] 'XHTML 1.1 documents SHOULD be labeled with the Internet Media Type application/xhtml+xml as defined in [RFC3236]. For further information on using media types with XHTML, see the informative note [XHTMLMIME].' The XHTMLMIME note says: [4] 'The 'text/html' media type [RFC2854] is primarily for HTML, not for XHTML. In general, this media type is NOT suitable for XHTML except when the XHTML is conforms to the guidelines in Appendix A.' [5] 'DO set the encoding via a meta http-equiv statement in the document (e.g., meta http-equiv=Content-Type content=text/html; charset=utf-8 /)' [1] http://www.w3.org/TR/xhtml1/#media [2] http://www.w3.org/TR/xhtml1/#C_9 [3] http://www.w3.org/TR/xhtml11/xhtml11.html#strict [4] http://www.w3.org/TR/xhtml-media-types/#text-html [5] http://www.w3.org/TR/xhtml-media-types/#C_9 -- leif halvard silli
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
Philippe Verdy, Thu, 29 Nov 2012 14:24:29 +0100: And you forget the important part of Appendix A: *Consequence*: Remember, however, that when the XML declaration is not included in a document, AND the character encoding is not specified by a higher level protocol such as HTTP, the document can only use the default character encodings UTF-8 or UTF-16. See, however, guideline 9http://www.w3.org/TR/xhtml-media-types/#C_9 below. Here we have an XHTML site that is already encoded with the default UTF-8. There's no reason then for Firefox or IE to render it with windows-1252, even if they ignore the XML prolog. the text/html content-type remains appropriate for XHTML 1.0, 1.1 or 5.0. Note that point 1, which you quoted,[1] and all the rest of the entire note, is about how *authors* should behave when they create XHTML documents. The note is *not* about how user agents should behave. Also note that what you refer to as the important part of Appendix A ends in a sentence that points to guideline 9, which in turn tells authors to 'DO set the encoding via a meta http-equiv' and note that the example in guideline 9 uses UTF-8 as example, quote: '(e.g., meta http-equiv=Content-Type content=text/html; charset=utf-8 /)'. ... But why ? Isn't UTF-8 (or alternatively UTF-16) already the default encoding of XHTML? If not, then we should file a bug in the W3C Validator for not honoring the Guideline 9 (even though it is not part of the standard itself, but just a recommendation, it should issue at least a warning). This is exactly the problem. Your if not does apply! Because, if one presents a XHTML document to the browser as HTML, then then windows-1252 - and not UTF-8 - becomes the default encoding. And, in fact, as consequence of our dialog, I have notified the developers of Unicorn about the shortcoming, asking them to issue a warning. [1] http://www.w3.org/TR/xhtml-media-types/#C_1 [2] http://www.w3.org/TR/xhtml-media-types/#C_9 -- leif halvard silli
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
2012/11/29 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no Philippe Verdy, Thu, 29 Nov 2012 14:24:29 +0100: ... But why ? Isn't UTF-8 (or alternatively UTF-16) already the default encoding of XHTML? If not, then we should file a bug in the W3C Validator for not honoring the Guideline 9 (even though it is not part of the standard itself, but just a recommendation, it should issue at least a warning). This is exactly the problem. Your if not does apply! Because, if one presents a XHTML document to the browser as HTML, then then windows-1252 - and not UTF-8 - becomes the default encoding. And, in fact, as consequence of our dialog, I have notified the developers of Unicorn about the shortcoming, asking them to issue a warning. Thanks a lot, this was really hard to see and understand, because I was only reading the XHTML specs, and the Validator did not complain. As a side note, the Unicorn Validator which senses the content-type (in its simple interface) will still sense an XHTML content which remains valid by itself. The issue is only when it is presented as HTML, and this validator should allow seeing the effect when using HTML parsers (HTML4 or HTML5) on XHTML documents, by offering the way to select another document type than the autodetected one (XHTML here), if ever the warning is displayed. Because the XHTML document may not validate at all when parsed as HTML (in which case it will first issue warnings about the presence of XML prologs (which are generally not a problem as they are typically ignored in browsers), but an error about XML processing instructions (I don't think that the optional leading XML declaration is a processing instruction), or an error about non-conforming document declaration (according to the selected HTML flavor: HTML4 or HTML5. Anyway, we can expect this page design error will be frequent, and HTML5 parsors should still better not discard the XML declaration, but at least recognize its encoding pseudo-attribute (even if the processing continues using HTML rules and not XML rules), instead of relying on the presence of the meta element, which is really ugly and forces the reparsing using the detected encoding instead of the default windows-1252 (this is unnecessarily slow). Making this Guideline 9 only applicable to past flavors of HTML before HTML5 when it will be released. In that case the warning issued by the Validator would only apply to HTML5 or before, but not HTML5. This will increase the comparibility of HTML5 to parse valid XHTML1 and XHTML5 documents simply created or modified by XML or XHTML editors.
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
In my opinion, from HTML5, and not XHTML5, there should also exist a leading prolog like ?html version=5.0 encoding=utf-8 For XHTML5, we will continue using the XML prolog ; but it *may* be followed by the html prolog, without needing to repeat the optional encoding pseudo-attribute, which XML parsers will treat as a parsing instruction: ?xml version=1.0 encoding=utf-8 ?html version=5.0 The absence of these prologs, will use the default encoding of each parser. Autosensing of document types wil remain possible and HTML5 will also no longer be dependant of transport protocols or to the very ugly meta http-equiv=Content-type value=text/html;charset=utf-8 element which forces the reparsing. The pseudo DOCTYPE tentatively introduced in HTML5 which breaks in SGML parsers and in past HTML parsers, should be eliminated from HTML5 if the HTML prolog is present (the HTML prolog would be highly prefered, including with its useful versioning).
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
Philippe Verdy, Thu, 29 Nov 2012 16:10:14 +0100: Thanks a lot, this was really hard to see and understand, because I was only reading the XHTML specs, and the Validator did not complain. Glad to find we are no the same page! Philippe Verdy, Thu, 29 Nov 2012 16:27:13 +0100: ?html version=5.0 encoding=utf-8 HTML5 already have 4 *conforming* methods for setting the UTF-8 encoding: 1. byte-order mark 2. HTTP server, Content-Type:text/html;charset=UTF-8 3. meta http-equiv, meta http-equiv=Content-Type content=text/html;charset=UTF-8/ 4. meta charset, meta charset=UTF-8/ (Note that there is no content-type here, and thus the meta charset method is more clean to use in a file served as XHTML.) In addition, other things have effect: 6. Sniffing is an official, but largely unimplemented method for getting the encoding (Chrome and Opera use it, and Firefox has it as an option and also uses it by default for some locales.) 7. The XML prologue (sic) takes effect in *some* browsers. 8. Simply serving the page as application/xhtml+xml is yet another method of setting the encoding to UTF-8. Thus I can guarantee you that your idea about at method number 9, is not going to be met with enthusiasm. -- leif halvard silli
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
- Method 1 (the BOM) is only goof for UTF-16. not reliable for UTF-8 whuch is still the default for XHTML (and where the BOM is not always present). - Method 2 is working sometimes, but is not practicle for many servers that you can't configure to change their content-type for specific pages all having the same *.html extension or relayed by some proxies, it is also dependant on the transport layer (HTTP here) to be capbel of offering it (HTML files in file systems do not provide the info). Bit if it is implemented it will take precedence, possibly indicating that the document was reencoded (by a proxy for example). - Method 3 and 4 are completely equivalent and share the same problem : they require restarting the parsing. They are equally ugly (just like all empty meta elements in the HTML header or in the body) intriducing another attribute to the meta element (which already has name, http-equiv, and now charset) is also a bad idea (data encoded in attributes that are part of the document root, breaks the concept of what is metadata); it also forbids the reencoding of the document during processing, if the document is digitally signed for its content, independantly of its encoding: to check the document signature, you would not only have to parse it completely up to the DOM level, but also ignore these specific meta elements (but not all meta elements like links) - Method 5 is where ? - Method 6 (sniffing) is a transitory solution (as long as HTML5 is not released) or last chance paliative solution based only on an heuristic, which fails sometimes. Not reliable. - Method 7 (using the XML prolog) is excellent for XML. It will reliably work with XHTML5, without needing reparsing. - Method 8 (content-type set as application/xhtml+xml in the transport layer) is exactly like method 2 (and suffers the same problem), but the content-type is not really intended for HTML5, not even XHTML5 as it implies an application and the extensible schema that XHTML5 will not parse. Method 8 for me implies the forced use of an XML parser, not an HTML parser. All XML extensions (including namespaces) will be valid My method is a generalisation to HTML of the excellent method 7 for XHTML (based on its standard and the XML standard). It requires absolutely no reparsing, and supports the explicit versioning of HTML (for future evolutions of its supported schema), without overwriting the independant versioning of XML if it is used. As well it does not require the new ugly DOCTYPE which indicates absolutely nothing signiicant, will not allow versioning, and breaks SGML parsers as well as XML parsers. It takes benefit of the fact that they don't break browsers in method 7 (even if some of them do not sniff at least the encoding from the XML prolog). 2012/11/29 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no Philippe Verdy, Thu, 29 Nov 2012 16:10:14 +0100: Thanks a lot, this was really hard to see and understand, because I was only reading the XHTML specs, and the Validator did not complain. Glad to find we are no the same page! Philippe Verdy, Thu, 29 Nov 2012 16:27:13 +0100: ?html version=5.0 encoding=utf-8 HTML5 already have 4 *conforming* methods for setting the UTF-8 encoding: 1. byte-order mark 2. HTTP server, Content-Type:text/html;charset=UTF-8 3. meta http-equiv, meta http-equiv=Content-Type content=text/html;charset=UTF-8/ 4. meta charset, meta charset=UTF-8/ (Note that there is no content-type here, and thus the meta charset method is more clean to use in a file served as XHTML.) In addition, other things have effect: 6. Sniffing is an official, but largely unimplemented method for getting the encoding (Chrome and Opera use it, and Firefox has it as an option and also uses it by default for some locales.) 7. The XML prologue (sic) takes effect in *some* browsers. 8. Simply serving the page as application/xhtml+xml is yet another method of setting the encoding to UTF-8. Thus I can guarantee you that your idea about at method number 9, is not going to be met with enthusiasm. -- leif halvard silli
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
Philippe Verdy, Thu, 29 Nov 2012 19:11:42 +0100: 2012/11/29 Leif Halvard Silli: Philippe Verdy, Thu, 29 Nov 2012 16:27:13 +0100: ?html version=5.0 encoding=utf-8 Thus I can guarantee you that your idea about at method number 9, is not going to be met with enthusiasm. - Method 5 is where ? Sorry. So your method is just method number 8, then. My method is a generalisation to HTML of the excellent method 7 I have given you my verdict. This topic is over for my part. Thanks for the exchange! -- leif halvard silli
Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)
Note that I challenge the term conforming you use, given that HTML5 is still not released, so its conformance is still not formally defined. The nu validator is still expliitly marked by the W3C as experimental. 2012/11/29 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no HTML5 already have 4 *conforming* methods for setting the UTF-8 encoding:
Re: xkcd: LTR
2012/11/28 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no Philippe Verdy, Wed, 28 Nov 2012 04:50:06 +0100: detects a violation of the required extended prolog (sorry, the HTML5 document declaration, which is not a valid document declaration for XHTML or for HTML4 or before or even for SGML, due to the unspecified schema after the shema short name), it should catch this exception to try another parser. There is no spec, that I am aware of, that says that it should do that. But this is in the scope of the HTML5 whose claimed purpose is to become compatible with documents encoded in all previous flavors of HTML. I admit that understanding the meaning behind all the slogans about HTML5, can be be demanding. But the goal has all the time been to create a *single* HTML parser, and not to introduce switching between multiple HTML parsers. If you think otherwise, then my claim is that you have misunderstood. In this case, Firefox and IE should not even be able to render *any* XHTML page because it violates the HTML5 standard. It still attempts to recover from it, recognizing a part of XHTML, but not an essential one : its very basic XML prolog (and the XHTML document declaration), up to the point where they start seeing the root element. But then how can they claim supporting XHTML when they don't (and when the XHTML syntax is still part of HTML5, which makes affirmations like yours – not to introduce switching between multiple HTML parsers – very weak and difficult to defend). If the intent is to be able to parse all flavors of HTML (at least a basic profile of them) with the same parser, then a behavior must be standardized in HTML5 to correctly handle the possible presence of XML prologs and standard SGML document declarations even if their contents are skipped and ignored (notably here when it is used to specify that the document is effectively encoded in UTF-8 and not cp1252, both encodings being supported by HTML5 but without other compatibility problems when it is UTF-8) But ignoring XHTML document declarations will have an impact on compatibility, if there's an external or internal DTD and this should be documented in HTML5 by limting the claims of compatibility (and suggesting then another recovery mechanism for these unsupported parts of XHTML, using a true XML parser in case of violation of the required HTML5 DOCTYPE declaration). For now this breaks the interoperability with the basic profile of XHTML, more or less compatible with HTML4 including the deprecated elements (but without the modular extension design, and without support of XML namespaces). Now the argument saying that meta elements may be used in the HTML document header (to replace missing HTTP MIME headers), this contradicts all what was done to deprecate this meta element usage before. And here also, HTML5 is not clear about this change of position. And meta elements won't make HTML parsers simpler to implement : they will need to reparse the document from the beginning. The XML prolog of XHTML is much simpler to parse than the meta element, and can be parsed directly by the HTML5-only parser, which can as well as accept at least the XHTML1 document declaration (without internal DTD) as acceptable for this HTML5 parser (it should fail however if there's an internal DTD or if the SGML catalog name is not one of those for HTML or XHTML; it should just check the SGML catalog name partly ignoring the flavor precision in its name, as there is no internal or external DTD supported in HTML4 or lower; it should silently ignore the URL for an external DTD in XHTML, including when XHTML is used as the alternate serialization syntax for HTML5, even if this will cause some defined entities defined in the external DTD not being replaced, but the result of the HTML5 parser with undefined entities or entities defined differently will be unpredictable). If an implementation can support both parsers, the more compatible recovery mode will be to use the XML parser, instead of using this simple heuristic. Browsers already support multiple text-encoded document parsers, including for HTML5 (Javascript, JSON, CSS, SVG/XML, P3P, URI...), plus binary parsers for various media codecs (PNG, GIF, JPEG, WAV, ICO, OpenPDF...) if they can embed them instead of using OS-supported codecs or plugins (MPEG, Ogg...), and data codecs (compressors, encryptors,, archive formats... for transport and security protocol layers referenced in URI schemes). What else ? In all popular browsers, the XML parser is still present, since long now, to support XML requests (and lots of GUI or configuration features, such as XUL in Firefox, VML in IE, external SVG images, local DB stores, support library for third-party addons...), even if JSON requests are highly preferred now, sometimes more secure, but much simpler and faster to parse (and more compact in their serialization).
UTF-8 isn't the default for HTML (was: xkcd: LTR)
Philippe Verdy, Wed, 28 Nov 2012 11:02:45 +0100: In this case, Firefox and IE should not even be able to render *any* XHTML page because it violates the HTML5 standard. (1) The page in question (http://www.xn--elqus623b.net/XKCD/1137.html) is (from a source code point of view) a pure XHTML page, and contains no HTML-compatible methods for declaring the encoding. And therefore, that page does indeed violate the HTML5 standard, with the result that browsers are permitted to fall back to their built-in default encodings. (2) According to XML, the XML prologue can be deleted for UTF-8 encoded pages. And when it is deleted/omitted, XML parsers assume that the page is UTF-8 encoded. And if you try that (that is: if you *do* delete the XML prologue from that page), then you will see that the Unicorn validator will *continue* to stamp that Web page as error free. This is because the Unicorn validator only considers the rules for XML - it doesn't consider the rules of HTML. (4) Also, when you do delete the XML prologue, then not only Firefox and IE will render the page in the wrong encoding, but even Safari. However, Opera and Chrome will continue to render the page as UTF-8 due to the UTF-8 sniffing that they cleverly have built in. Clearly, Opera and Chrome's behaviour is the way to go. (5) It is indeed backwards that the W3C Unicorn validator doesn't inform its users when their pages fail to include a HTML-compatible method for declaring the encoding. This suboptimal validation could partly be related to libxml2, which Unicorn is partly based on. Because - as it turns out - the command line tool xmllint (which is part of libxml2) shows a very similar behaviour to that of Unicorn: It pays no respect to the fact that the MIME type (or Content-Type:) is 'text/html' and not an XML MIME type. In fact, when you do delete the XML prologue, Unicorn issues this warning (you must click to make it visible): No Character Encoding Found! Falling back to UTF-8. Which is a quite confusing message to send given that HTML parser does not, as their last resort, fall back to UTF-8. -- leif halvard silli
Re: xkcd: LTR
On 11/26/2012 08:42 PM, Marc Durdin wrote: Somewhat ironically, both Firefox and Internet Explorer, on my machine at least, detect this page is encoded with ISO-8859-1 and cp-1252 respectively, instead of UTF-8. It seems they both ignore the XML prolog which is the only place where the encoding is stated. Firefox follows the HTML5 spec and ignores the XML prolog, since the Content-type is text/html.
Re: xkcd: LTR
Simon, There's no sign of HTML5 on that page. The head of the file matches all XHTML 1.1 requirements and passes all checks on validator.w3.org. Now, why would Firefox follow anything from HTML5 spec here? -Behnam On Tue, Nov 27, 2012 at 3:37 AM, Simon Montagu smont...@smontagu.orgwrote: On 11/26/2012 08:42 PM, Marc Durdin wrote: Somewhat ironically, both Firefox and Internet Explorer, on my machine at least, detect this page is encoded with ISO-8859-1 and cp-1252 respectively, instead of UTF-8. It seems they both ignore the XML prolog which is the only place where the encoding is stated. Firefox follows the HTML5 spec and ignores the XML prolog, since the Content-type is text/html. -- Behnam Esfahbod | بهنام اسفهبد http://behnam.es/ http://zwnj.behnam.es/ GPG Fingerprint: 3E7F B4B6 6F4C A8AB 9BB9 7520 5701 CA40 259E 0F8B
Re: xkcd: LTR
On 11/27/2012 11:19 AM, Behnam Esfahbod ZWNJ wrote: Simon, There's no sign of HTML5 on that page. The head of the file matches all XHTML 1.1 requirements and passes all checks on validator.w3.org http://validator.w3.org. Now, why would Firefox follow anything from HTML5 spec here? As I already said, because of the Content-Type HTTP header
Re: xkcd: LTR
HTML5 does not reference the Content-Type: text/html header as enough to qualify as meaning HTML5. HTML5 **requires** its own prolog (i.e. its basic document declaration **within** the document itself, for the HTML syntax, or its FULL document declaration for the XML/XHTML syntax). So Firefox is wrong and attempts to use HTML5 to render all HTML dialects. 2012/11/27 Simon Montagu smont...@smontagu.org On 11/27/2012 11:19 AM, Behnam Esfahbod ZWNJ wrote: Simon, There's no sign of HTML5 on that page. The head of the file matches all XHTML 1.1 requirements and passes all checks on validator.w3.org http://validator.w3.org. Now, why would Firefox follow anything from HTML5 spec here? As I already said, because of the Content-Type HTTP header
Re: xkcd: LTR
I've never said that user agents had to 'write the prolog. It's the reverse: yes authors have to write a prolog (but the prolog is perfect here so this is not the fault of the author). Why do have to use this prolog, it's exactly because user agents will have to read it (not write it), as it is expected for validating that this is effectively an HTML5 content (the Content-Type: text/html is clearly not enough, it is exactly the same as HTML4 or all past versions of HTML, working in quirk mode or not). By your assertion, all HTML5 browsers would then need to parse HTML4 as if it was HTML5, using its strict definition that are not compatible with HTML4 (even if we ignore the quirks mode), or all past versions. HTML5 parsing is triggered by the presence of the required HTML5 prolog.
Re: xkcd: LTR
Also you make a confusion in the sense that HTML5 must be able to parse HTML4. This is true, but this does not mean that they will be able to render it fully. HTML5 is not fully upward compatible with past versions (and the case of the identification of encodings is an example where it is different, and many requirements of HTML4 are no longer requirements in HTML5 due to some relaxed rules after the faield effort to standardize HTML4 more like XHTML and according to the initial CSS specifications). So HTML5 renderers will just render HTML4 in a best effort, but lots of requirements that are applicable to real *HTML5* documents (identified by their prolog) do NOT apply to non-HTML5 documents as they are not directly in scope of its standard (hte HTML4 specifications themselves are not dismissed) : the best effort implies flexibility, even if interoperability is not warrantied across HTML5 implementations that will all parse HTML4 documents but may still produce different results (inclusing with the support of HTML4 quirk mode if they want). 2012/11/27 Masatoshi Kimura vyv03...@nifty.ne.jp (2012/11/27 20:27), Philippe Verdy wrote: HTML5 does not reference the Content-Type: text/html header as enough to qualify as meaning HTML5. HTML5 User-agents must parse any byte sequences as HTML5 document if the Content-Type is text/html. HTML5 **requires** its own prolog (i.e. its basic document declaration **within** the document itself, for the HTML syntax, or its FULL document declaration for the XML/XHTML syntax). HTML5 requires **authors** to write the prolog, not user-agents. Lacking prolog just turn the user-agents to quirks mode. Note that quirks mode doesn't mean do whatever you consider it quirks. Parsing quirks mode document is also completely spec'ed. So Firefox is wrong and attempts to use HTML5 to render all HTML dialects. No, not at all. Rather, it is required by the spec to use HTML5 parser to parse all byte sequences sent with Content-Type: text/html. Could you please stop spreading an unfounded rumor such as Firefox is wrong because it ignores the lacking of HTML5 prolog? -- vyv03...@nifty.ne.jp
Re: xkcd: LTR
Philippe Verdy, Tue, 27 Nov 2012 15:39:43 +0100: I've never said that user agents had to 'write the prolog. It's the reverse: yes authors have to write a prolog (but the prolog is perfect here so this is not the fault of the author). XML has (or more correctly: can have) a prolog. HTML does not have a prolog. Now to the million dollar question: is your page in question XML or HTML? Answer: Per the Content-Type, then it is HTML (that is: text/html). Next question: Does the XML prolog have any effect when the XML file (more specifically: the XHTML file) is served as HTML (that is: text/html)? The answer is that, per HTML5, it does not have effect. And of course, per HTML4, it does not have effect. As for XHTML 1, then it cannot really regulate what is supposed to happen in text/html. The problem/challenge, hover is that some Web browsers - such as W3m (a text browser), Chrome, Opera and Safari - *do* look at the prolog for encoding info *also* when served as HTML. But Firefox and Internet Explorer do not. Which is according the HTML5 specification. My guess is that it will *never* become conforming to use the XML prologue in HTML files. However, that does not necessarily prevent Firefox from looking at the prologue for encoding info, when *that* is the only source of encoding info. In fact, I think the HTML5 encoding sniffing algorithm already permits this (since it it has a step which roughly says if the user agent have other sources of information.) So, for what it is worth - and with reference to your pages, I filed a bug against Firefox, to make it start to use the encoding declartion of the XML prologue, when nothing else is available: https://bugzilla.mozilla.org/show_bug.cgi?id=815279 -- leif halvard silli
Re: xkcd: LTR
Looks OK here, but that is probably FreeType doing its magic as usual. Regards, Khaled On Tue, Nov 27, 2012 at 02:29:45AM +0100, Philippe Verdy wrote: Also I really don't like the Deseret font: {font-family: CMU; src: url(CMUSerif-Roman.ttf) format(truetype);} that you have inserted in your stylesheet (da.css) which is used to display the whole text content of the page, including the English Latin text at the bottom part. This downloaded font is difficult to read as it is not hinted at all (so its rendering on screen is extremely poor, we probably don't want to print each page of this XKCD series, when the main interest is the image which is perfectly readable). Could you ask to someone in this list to help you hinting this font a minimum (even basic autohinting would be much better). 2012/11/27 Philippe Verdy verd...@wanadoo.fr Did you try add the xml:lang=en-Dsrt pseudo-attribute to the html element, as suggested by the W3C Unicorn validator ? http://validator.w3.org/unicorn/check?ucn_uri=www.xn--elqus623b.net%2FXKCD%2F1138.htmlucn_lang=frucn_task=conformance# May be this could help IE and Firefox that can't figure out the language used to properly detect the encoding if they still don't trust the XML declaration in this case, to avoid them to use an encoding guesser. It is anyay curious because this site is valid as XHTML 1.1 (not as HTML5 which uses a very different and simplified prolog, which is not matched here, so the legacy rules should apply to detect XHTML here, then legacy HTML4 if XHTML is no longer recognized by IE and Firefox). Because XHTML is properly tagged, the XML requirements should apply and the XML declaration in the prolog should be used without needing to guess the encoding from the rest of the content (starting by a meta element in the HTML head element). 2012/11/27 John H. Jenkins jenk...@apple.com That's because the domain does, in fact, use sinograms and not Deseret. (It's my Chinese name.) On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote: I wonder why this IDN link appears to me using sinograms in its domain name, instead of Deseret letters. The link works, but my browser cannot display it and its displays the Punycoded name instead without decoding it. This is strange because I do have Deseret fonts installed and I can view Unicoded HTML pages containing Deseret letters. 2012/11/26 John H. Jenkins jenk...@apple.com Or, if one prefers: http://www.井作恆.net/XKCD/1137.htmlhttp://www.xn--elqus623b.net/XKCD/1137.html On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote: http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie
Re: xkcd: LTR
A ! I see now the problem: the XHTML file is being served as HTML instead of XHTML (but this is not invalid for XHTML 1). But anyway you're also right that the XML prolog found is NOT valid for HTML5 when the file is served as HTML instead of XHTML. This should immediately trigger the fact that HTML5 should not be used to render the page in the HTML profile. So these browsers must find something else: given the XML prolog they should then use HTML5 in its XHTML profile, not in its HTML profile ; in this profile, they MUST honor the XML prolog and notably its XML encoding declaration (given that the encoding is not specified in the HTTP Content-type. Now given the XML prolog and the DTD declaration, the file is clearly not even HTML5 in XML/XHTML (i.e. XHTML 5), but is XHTML 1 (based on a stable subset of HTML4, but working in strict mode without the quirks modes). Once again, this excludes using the HTML5 rules again. I'm still convinced that these are bugs in Firefox and IE, which support only HTML5 in its basic HTML profile, but not HTML5 in its XML/XHTML profile (which is also part of the HTML5 standard and where processing the XML prolog is NOT an option but a requirement). 2012/11/27 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no Philippe Verdy, Tue, 27 Nov 2012 15:39:43 +0100: I've never said that user agents had to 'write the prolog. It's the reverse: yes authors have to write a prolog (but the prolog is perfect here so this is not the fault of the author). XML has (or more correctly: can have) a prolog. HTML does not have a prolog. Now to the million dollar question: is your page in question XML or HTML? Answer: Per the Content-Type, then it is HTML (that is: text/html). Next question: Does the XML prolog have any effect when the XML file (more specifically: the XHTML file) is served as HTML (that is: text/html)? The answer is that, per HTML5, it does not have effect. And of course, per HTML4, it does not have effect. As for XHTML 1, then it cannot really regulate what is supposed to happen in text/html. The problem/challenge, hover is that some Web browsers - such as W3m (a text browser), Chrome, Opera and Safari - *do* look at the prolog for encoding info *also* when served as HTML. But Firefox and Internet Explorer do not. Which is according the HTML5 specification. My guess is that it will *never* become conforming to use the XML prologue in HTML files. However, that does not necessarily prevent Firefox from looking at the prologue for encoding info, when *that* is the only source of encoding info. In fact, I think the HTML5 encoding sniffing algorithm already permits this (since it it has a step which roughly says if the user agent have other sources of information.) So, for what it is worth - and with reference to your pages, I filed a bug against Firefox, to make it start to use the encoding declartion of the XML prologue, when nothing else is available: https://bugzilla.mozilla.org/show_bug.cgi?id=815279 -- leif halvard silli
Re: xkcd: LTR
No. Freetype is not involved here for the ugly rendering (on screen) under Windows of the unhinted CMU font provided by the page. May be this looks OK on Mac (if Safari is autohinting the font itself, despite the font is not autohinted itself ; I'm not sure that Safari on MacOS processes TTF fonts this way when they are not hinted, and I'm convinced that unhinted fonts should not be autohinted magically by the renderer). So using the xml:lang=en-Dsrt pseudo-attribute remains a good suggestion to allow a CSS stylesheet to avoid using referening CMU font on Windows and MacOS when displaying the Latin text (using xml:lang=en) and to allow the same stylesheet to specify a much better Deseret font for Windows (Segoe UI is fine on Windows). There will still remain a problem for redering the page in Linux (where FreeType is used and which is not authinting itself the unhinted font, and where Segoe UI is not available) and in Windows before Windows 7 (no Segoe UI font as well, you'll also need a hinted version of the CMU font). 2012/11/27 Khaled Hosny khaledho...@eglug.org Looks OK here, but that is probably FreeType doing its magic as usual.
Re: xkcd: LTR
Philippe Verdy, Tue, 27 Nov 2012 21:07:31 +0100: A ! I see now the problem: the XHTML file is being served as HTML instead of XHTML (but this is not invalid for XHTML 1). Both SGML-based HTML4 and XML-based XHTML 1 operate with syntax rules that are not - and has never been - compatible with the way text/html operates. Thus, both HTML4 and XHTML1 permits syntaxes whose semantics are ignored when the document is parsed as HTML (as opposed to parsed as SGML or as XML). If you you are interested in creating XHTML syntax that is compatible with HTML, then you should look at Polyglot Markup: http://www.w3.org/TR/html-polyglot/ But anyway you're also right that the XML prolog found is NOT valid for HTML5 when the file is served as HTML instead of XHTML. The fact that XHTML 1 permits the XML prolog regardless how the document is served, is just a shortcoming of the XHTML 1 specification. So these browsers must find something else: given the XML prolog they should then use HTML5 in its XHTML profile, not in its HTML profile No, that is not how things works. The decision to parse the document as HTML is taken before the browser sees the XML prologue. So the prologue should not - and does not - change anything with regard to parsing as HTML or as XML. ; in this profile, they MUST honor the XML prolog and notably its XML encoding declaration (given that the encoding is not specified in the HTTP Content-type. Again: Absolutely not. They must not, will not and must not honour the XML prologue. (It is another matter that some user agents sometimes use the prologue to look for encoding information.) Now given the XML prolog and the DTD declaration, the file is clearly not even HTML5 in XML/XHTML (i.e. XHTML 5), but is XHTML 1 (based on a stable subset of HTML4, but working in strict mode without the quirks modes). Once again, this excludes using the HTML5 rules again. In a way the names and the numbers (HTML4, XHTML1, HTML5) are just confusing. There is just one way to parse HTML. When it comes to HTML (text/html),then HTML5 differs from HTML4 and XHTML1 in that it is not based on a *another* format than HTML itself. Because HTML4 and XHTML1 are not based on how HTML actually works, and - in addition - does not take fully account of that (or whatever the reason), they allow syntaxes, such as DTD declarations, which have no effect (except side-effects such as quirks-mode) in HTML. I'm still convinced that these are bugs in Firefox and IE, which support only HTML5 in its basic HTML profile, but not HTML5 in its XML/XHTML profile (which is also part of the HTML5 standard and where processing the XML prolog is NOT an option but a requirement). Just for the record: HTML5 defines the most up-to-date parsing mechanism for *all* HTML documents - HTML1,2,3,5 as well as any flavour of XHTML served as HTML. HTML5 does not allow authors to use the XML prologue. So while XHTML1 allows you to use the prologue, the best description of how to parse anything that purports to be HTML - HTML5 - does not require user agents/browsers to pay any attention to the prologue. Thus the correct one to blame in this case for the fact that it doesn't work in Firefox, seems to be the author. (Though we could also blame the The history of how HTML developed. -- leif halvard silli
Re: xkcd: LTR
On 11/27/2012 5:39 AM, Masatoshi Kimura wrote: (2012/11/27 20:27), Philippe Verdy wrote: Could you please stop spreading an unfounded rumor such as Firefox is wrong because it ignores the lacking of HTML5 prolog? Getting Philippe to stop spreading unfounded anything is a near impossible task. :) A./
Re: xkcd: LTR
2012/11/27 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no The fact that XHTML 1 permits the XML prolog regardless how the document is served, is just a shortcoming of the XHTML 1 specification. No, it was by design. Making HTML an application of XML. Only XML but with all rules of XML. So these browsers must find something else: given the XML prolog they should then use HTML5 in its XHTML profile, not in its HTML profile No, that is not how things works. The decision to parse the document as HTML is taken before the browser sees the XML prologue. So the prologue should not - and does not - change anything with regard to parsing as HTML or as XML. Then explain why the W3C validator sees absolutley no problem in the way these XHTML1 pages are encoded and transported. ; in this profile, they MUST honor the XML prolog and notably its XML encoding declaration (given that the encoding is not specified in the HTTP Content-type. Again: Absolutely not. They must not, will not and must not honour the XML prologue. (It is another matter that some user agents sometimes use the prologue to look for encoding information.) Sure they can because this XHTML1 site violates HTML5 rules, missing its required prologue. Now given the XML prolog and the DTD declaration, the file is clearly not even HTML5 in XML/XHTML (i.e. XHTML 5), but is XHTML 1 (based on a stable subset of HTML4, but working in strict mode without the quirks modes). Once again, this excludes using the HTML5 rules again. In a way the names and the numbers (HTML4, XHTML1, HTML5) are just confusing. There is just one way to parse HTML. When it comes to HTML (text/html),then HTML5 differs from HTML4 and XHTML1 in that it is not based on a *another* format than HTML itself. Because HTML4 and XHTML1 are not based on how HTML actually works, and - in addition - does not take fully account of that (or whatever the reason), they allow syntaxes, such as DTD declarations, which have no effect (except side-effects such as quirks-mode) in HTML. HTML5 admits the two syntaxes : SGML-based like it is used primarily (in a simplified profile), and XML. I'm still convinced that these are bugs in Firefox and IE, which support only HTML5 in its basic HTML profile, but not HTML5 in its XML/XHTML profile (which is also part of the HTML5 standard and where processing the XML prolog is NOT an option but a requirement). Just for the record: HTML5 defines the most up-to-date parsing mechanism for *all* HTML documents - HTML1,2,3,5 as well as any flavour of XHTML served as HTML. HTML5 does not allow authors to use the XML prologue. Where ? The required HTML5 prolog applies to its SGML based syntax ; it makes no sense in XHTML as it voluntarily violates the validity of the XML document declaration. The absence of the HTML5 required prolog (in its standard basic-SGML profile), or the presence of another incompatible XML prolog is enough to make the distinction between the two syntaxes. But both syntaxes will generate the same HTML DOM, which is just enough to make the proper rendering intended, and make HTML5 compatible with both syntaxes. Now HTML5 is still not completely polished, finished and approved. Such interoperability rules are not clearly defined even if they are the most up-to-date to make it work seamlessly with the claimed compatibility with all flavors of HTML or XHTML. And the fact that Firefox and IE behave differently from Chorme and Safari in this domain is a proof of this unfinished status.
Re: xkcd: LTR
Philippe Verdy, Wed, 28 Nov 2012 01:10:45 +0100: 2012/11/27 Leif Halvard Silli The fact that XHTML 1 permits the XML prolog regardless how the document is served, is just a shortcoming of the XHTML 1 specification. No, it was by design. Making HTML an application of XML. Only XML but with all rules of XML. It was by design. But nevertheless a shortcoming. They should/could have defined more restrictions on the syntax than then they did, and then it would have been OK. But don't forget that XHTML1 also permits you to use the meta element - which works in all web browsers, for setting the encoding: meta http-equiv=Content-Type content=text/html; charset=UTF-8 / This is described in the famous Appendix C of XHTML 1: http://www.w3.org/TR/xhtml1/#C_9 So these browsers must find something else: given the XML prolog they should then use HTML5 in its XHTML profile, not in its HTML profile No, that is not how things works. The decision to parse the document as HTML is taken before the browser sees the XML prologue. So the prologue should not - and does not - change anything with regard to parsing as HTML or as XML. Then explain why the W3C validator sees absolutley no problem in the way these XHTML1 pages are encoded and transported. Because it only checks the syntax, without asking you how you are actually going to use that syntax - whether you want to serve it to an XML parser as XHTML or you are going to serve it to an HTML parser. For a new version of the validator, that ask more of those questions, please try http://validator.w3.org/nu/ - it happens to for the most part be developed by one of the Firefox developers, btw. And it allows you to check XHTML1-syntax as well (but only if you serve it as XHTML - if you serve it as HTML, then it validates it as HTML.) ; in this profile, they MUST honor the XML prolog and notably its XML encoding declaration (given that the encoding is not specified in the HTTP Content-type. Again: Absolutely not. They must not, will not and must not honour the XML prologue. (It is another matter that some user agents sometimes use the prologue to look for encoding information.) Sure they can because this XHTML1 site violates HTML5 rules, missing its required prologue. Not sure how you understand the phrase honour the XML prologue. It also sounds as if you say that HTML5 has its own prologue. But HTML5 does not contain any code that is commonly known as prologue. For instance, if you refer to the code !DOCTYPE html, then this is not a prologue even if it occurs at the start of the document. Also, since there are two flavours of XML - XML 1.0 and XML 1.1, the prologue may potentially have an effect on how the document is parsed, but only if the parser already knows that the file is XML. But the XML prologue does not *cause* parsers to choose XML-mode rather than HTML-mode. (Opera introduced the opposite thing some time ago: If the document is an XHTML document - for real, but contains XML wellformedness errors, then it will switch to HTML-mode.) Now given the XML prolog and the DTD declaration, the file is clearly not even HTML5 in XML/XHTML (i.e. XHTML 5), but is XHTML 1 (based on a stable subset of HTML4, but working in strict mode without the quirks modes). Once again, this excludes using the HTML5 rules again. In a way the names and the numbers (HTML4, XHTML1, HTML5) are just confusing. There is just one way to parse HTML. When it comes to HTML (text/html),then HTML5 differs from HTML4 and XHTML1 in that it is not based on a *another* format than HTML itself. Because HTML4 and XHTML1 are not based on how HTML actually works, and - in addition - does not take fully account of that (or whatever the reason), they allow syntaxes, such as DTD declarations, which have no effect (except side-effects such as quirks-mode) in HTML. HTML5 admits the two syntaxes : SGML-based like it is used primarily (in a simplified profile), and XML. From one angle, you are off course right. But HTML5 actually explains that what you call SGML-based is not SGML-based but only SGML *inspired*. Thus, HTML5 is much simpler and less cryptic than the (official) SGML syntax of HTML4. I'm still convinced that these are bugs in Firefox and IE, which support only HTML5 in its basic HTML profile, but not HTML5 in its XML/XHTML profile (which is also part of the HTML5 standard and where processing the XML prolog is NOT an option but a requirement). Just for the record: HTML5 defines the most up-to-date parsing mechanism for *all* HTML documents - HTML1,2,3,5 as well as any flavour of XHTML served as HTML. HTML5 does not allow authors to use the XML prologue. Where ? Here: http://dev.w3.org/html5/spec/syntax.html#writing (As you can see, it doesn't say that it is allowed, hence it is not.) You can also see the bottom of this page: http://dev.w3.org/html5/spec/the-meta-element.html#charset The required
Re: xkcd: LTR
2012/11/28 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no For a new version of the validator, that ask more of those questions, please try http://validator.w3.org/nu/ - it happens to for the most part be developed by one of the Firefox developers, btw. And it allows you to check XHTML1-syntax as well (but only if you serve it as XHTML - if you serve it as HTML, then it validates it as HTML.) This new validator is not the one promoted and supported. I use the Unicorn validator that checks all W2C supported markup languages (including HTML5). ; in this profile, they MUST honor the XML prolog and notably its XML encoding declaration (given that the encoding is not specified in the HTTP Content-type. Again: Absolutely not. They must not, will not and must not honour the XML prologue. (It is another matter that some user agents sometimes use the prologue to look for encoding information.) Sure they can because this XHTML1 site violates HTML5 rules, missing its required prologue. Not sure how you understand the phrase honour the XML prologue. It also sounds as if you say that HTML5 has its own prologue. But HTML5 does not contain any code that is commonly known as prologue. For instance, if you refer to the code !DOCTYPE html, then this is not a prologue even if it occurs at the start of the document. Question of terminology specific to this version, I consider it part of the prolog, and it is not valid XML, so not valid XHTML. From one angle, you are off course right. But HTML5 actually explains that what you call SGML-based is not SGML-based but only SGML *inspired*. Thus, HTML5 is much simpler and less cryptic than the (official) SGML syntax of HTML4. It is evident that here I mean the legacy HTML syntax, not compatible with XML (it allows closing tags, and does not require self-closed tags for empty elements). I'm still convinced that these are bugs in Firefox and IE, which support only HTML5 in its basic HTML profile, but not HTML5 in its XML/XHTML profile (which is also part of the HTML5 standard and where processing the XML prolog is NOT an option but a requirement). Just for the record: HTML5 defines the most up-to-date parsing mechanism for *all* HTML documents - HTML1,2,3,5 as well as any flavour of XHTML served as HTML. HTML5 does not allow authors to use the XML prologue. Where ? Here: http://dev.w3.org/html5/spec/syntax.html#writing (As you can see, it doesn't say that it is allowed, hence it is not.) You can also see the bottom of this page: http://dev.w3.org/html5/spec/the-meta-element.html#charset The required HTML5 prolog applies to its SGML based syntax ; Please note that prolog is one thing, and the DOCTYPE is another, see XML 1.0: http://www.w3.org/TR/REC-xml/#sec-prolog-dtd Yes I know the terminolgy, but it's evident that I'm including the document declaration as part of the prolog (i.e. everything that is not comment and that appears before the root element) it makes no sense in XHTML as it voluntarily violates the validity of the XML document declaration. If you are speaking about the HTML5 doctype, then its only effect is to make sure that the HTML parser stays in no-quirks (aka standards) mode. In XHTML then, you are right that it is not needed. But you are wrong if you say that it is a problem to include it in XHTML, as it causes no harm. In fact, in XHTML, you can drop both the DOCTYPE and the XML prologue. The absence of the HTML5 required prolog (in its standard basic-SGML profile), or the presence of another incompatible XML prolog is enough to make the distinction between the two syntaxes. You mean: Visually? Yes. However, that is not how parsers think. What parsers normally do is that they look at the Content-Type flag, before they decide how to parse the document. True, but then when the HTML5 parser detects a violation of the required extended prolog (sorry, the HTML5 document declaration, which is not a valid document declaration for XHTML or for HTML4 or before or even for SGML, due to the unspecified schema after the shema short name), it should catch this exception to try another parser. The XML declaration itself is enough to throw the exception and so easy to detect to allow changing from an HTML parser to an XML parser for XHTML. If even the XML parser fails, then retry with a legacy HTML parser working in quirks mode. Now HTML5 is still not completely polished, finished and approved. Such interoperability rules are not clearly defined even if they are the most up-to-date to make it work seamlessly with the claimed compatibility with all flavors of HTML or XHTML. And the fact that Firefox and IE behave differently from Chorme and Safari in this domain is a proof of this unfinished status. I would not conclude like that … But it could probably have saved us this discussion if Firefox/IE, like the other dominating browsers, did use it as a
Re: xkcd: LTR
detects a violation of the required extended prolog (sorry, the HTML5 document declaration, which is not a valid document declaration for XHTML or for HTML4 or before or even for SGML, due to the unspecified schema after the shema short name), it should catch this exception to try another parser. There is no spec, that I am aware of, that says that it should do that. But this is in the scope of the HTML5 whose claimed purpose is to become compatible with documents encoded in all previous flavors of HTML. Otherwise this claim is very weak and HTML5 is just a standard compatible with itself, and nothing else (it breaks XHTML rules, and SGML rules for the document declaration, and IETF charset naming rules with its reinterpretation of ISO8859-1, which is also still not stabilized). HTML5 is still beta in these claims, and it's regrettable that its required document declaration does not even specify its SGML catalog entry name, even if it forbids the insertion of a DTD. One day or another, at least the SGML catalog entry name will come back, when HTML5 will have been released and a newer version will be needed and developed, and HTML5 should still allow the presence of this SGML catalog entry name, even if it does not require it in this version.
Re: xkcd: LTR
Philippe Verdy, Wed, 28 Nov 2012 04:23:10 +0100: 2012/11/28 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no For a new version of the validator, that ask more of those questions, please try http://validator.w3.org/nu/ - it happens to for the most part be developed by one of the Firefox developers, btw. And it allows you to check XHTML1-syntax as well (but only if you serve it as XHTML - if you serve it as HTML, then it validates it as HTML.) This new validator is not the one promoted and supported. I use the Unicorn validator that checks all W2C supported markup languages (including HTML5). The nu validator is good if you are interested in the questions I mentioned above. Please note that prolog is one thing, and the DOCTYPE is another, see XML 1.0: http://www.w3.org/TR/REC-xml/#sec-prolog-dtd Yes I know the terminolgy, but it's evident that I'm including the document declaration as part of the prolog (i.e. everything that is not comment and that appears before the root element) It is just as confusing as ever that you continue to insist on your terminology. The absence of the HTML5 required prolog (in its standard basic-SGML profile), or the presence of another incompatible XML prolog is enough to make the distinction between the two syntaxes. You mean: Visually? Yes. However, that is not how parsers think. What parsers normally do is that they look at the Content-Type flag, before they decide how to parse the document. True, but then when the HTML5 parser The HTML5 parser is just the one and only (updated) HTML parser. detects a violation of the required extended prolog (sorry, the HTML5 document declaration, which is not a valid document declaration for XHTML or for HTML4 or before or even for SGML, due to the unspecified schema after the shema short name), it should catch this exception to try another parser. There is no spec, that I am aware of, that says that it should do that. -- leif halvard silli
Re: xkcd: LTR
Philippe Verdy, Wed, 28 Nov 2012 04:50:06 +0100: detects a violation of the required extended prolog (sorry, the HTML5 document declaration, which is not a valid document declaration for XHTML or for HTML4 or before or even for SGML, due to the unspecified schema after the shema short name), it should catch this exception to try another parser. There is no spec, that I am aware of, that says that it should do that. But this is in the scope of the HTML5 whose claimed purpose is to become compatible with documents encoded in all previous flavors of HTML. I admit that understanding the meaning behind all the slogans about HTML5, can be be demanding. But the goal has all the time been to create a *single* HTML parser, and not to introduce switching between multiple HTML parsers. If you think otherwise, then my claim is that you have misunderstood. Otherwise this claim is very weak and HTML5 is just a standard compatible with itself, Yes, HTML5 is a standard in itself. For instance, the issue of the XML prologue have been mentioned from time to time during the HTML5 process, but a deliberate choice was made to not accept it as part of the syntax. Probably one of the motivations for why the editor made that choice was to help authors to keep HTML and XHTML separate. Also, HTML5 contains some willful violations of other standards. But then, a standard is supposed to set a new standard, hence that should in principle be OK. But it is true that terms such as Web compatible and compatible in general have been used sloganishly about HTML5. I think, in one way, it was just a method for getting things to move. But it is not so that compatible has trumped every other HTML5 design options - other things to consider are for instance that the end result - then final syntax - becomes simple to understand, without too complicated and convoluted rules. Just my two cents, about how I see it. -- leif halvard silli
Re: xkcd: LTR
Or, if one prefers: http://www.井作恆.net/XKCD/1137.html On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote: http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie
RE: xkcd: LTR
Somewhat ironically, both Firefox and Internet Explorer, on my machine at least, detect this page is encoded with ISO-8859-1 and cp-1252 respectively, instead of UTF-8. It seems they both ignore the XML prolog which is the only place where the encoding is stated. From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of John H. Jenkins Sent: Tuesday, 27 November 2012 1:15 AM To: Unicode Mailing List Subject: Re: xkcd: LTR Or, if one prefers: http://www.井作恆.net/XKCD/1137.html On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.commailto:golds...@apple.com wrote: http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie
Re: xkcd: LTR
I wonder why this IDN link appears to me using sinograms in its domain name, instead of Deseret letters. The link works, but my browser cannot display it and its displays the Punycoded name instead without decoding it. This is strange because I do have Deseret fonts installed and I can view Unicoded HTML pages containing Deseret letters. 2012/11/26 John H. Jenkins jenk...@apple.com Or, if one prefers: http://www.井作恆.net/XKCD/1137.html On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote: http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie
Re: xkcd: LTR
Not a bug of your machine or browser; this is a problem of the webserver in its metadata. The transport layer indicates to the client another encoding in HTTP headers, and it prevails to what the document encodes. In this case, the webserver should be able to transform the source document to match what it indicates in HTTP headers, or should better identidy its local file contents to send the correct HTTP header). Send a bug report to the site admin to fix its web server settings, possibly per directory, or using a naming scheme for webpages that are encoded differently, e.g. http://www.example.net/path/to/file.UTF-8.html; will request the content of a file named file.UTF-8.html with an explicit extension *.UTF-8.html which can be mapped by the server using another HTTP header for the effective UTF-8 encoding (instead of using cp-1252). My opinion however is that new contents should always be encoded in UTF-8, and older contents may be linked to another effective archiving directory where it can be mapped to the older encoding without having to reencode the old content. 2012/11/26 Marc Durdin marc.dur...@tavultesoft.com Somewhat ironically, both Firefox and Internet Explorer, on my machine at least, detect this page is encoded with ISO-8859-1 and cp-1252 respectively, instead of UTF-8. It seems they both ignore the XML prolog which is the only place where the encoding is stated. ** ** *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On Behalf Of *John H. Jenkins *Sent:* Tuesday, 27 November 2012 1:15 AM *To:* Unicode Mailing List *Subject:* Re: xkcd: LTR ** ** Or, if one prefers: ** ** http://www.井作恆.net/XKCD/1137.html ** ** On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote:* *** http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie ** **
Re: xkcd: LTR
That's because the domain does, in fact, use sinograms and not Deseret. (It's my Chinese name.) On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote: I wonder why this IDN link appears to me using sinograms in its domain name, instead of Deseret letters. The link works, but my browser cannot display it and its displays the Punycoded name instead without decoding it. This is strange because I do have Deseret fonts installed and I can view Unicoded HTML pages containing Deseret letters. 2012/11/26 John H. Jenkins jenk...@apple.com Or, if one prefers: http://www.井作恆.net/XKCD/1137.html On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote: http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie
RE: xkcd: LTR
In this instance the web server is not returning an encoding (“Content-Type: text/html”), which is why I was curious to see that neither web browser picked up the UTF-8 hint in the XML prolog. Chrome does detect UTF-8 for that page. From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy Sent: Tuesday, 27 November 2012 7:49 AM To: Marc Durdin Cc: John H. Jenkins; Unicode Mailing List Subject: Re: xkcd: LTR Not a bug of your machine or browser; this is a problem of the webserver in its metadata. The transport layer indicates to the client another encoding in HTTP headers, and it prevails to what the document encodes. In this case, the webserver should be able to transform the source document to match what it indicates in HTTP headers, or should better identidy its local file contents to send the correct HTTP header). Send a bug report to the site admin to fix its web server settings, possibly per directory, or using a naming scheme for webpages that are encoded differently, e.g. http://www.example.net/path/to/file.UTF-8.html; will request the content of a file named file.UTF-8.html with an explicit extension *.UTF-8.html which can be mapped by the server using another HTTP header for the effective UTF-8 encoding (instead of using cp-1252). My opinion however is that new contents should always be encoded in UTF-8, and older contents may be linked to another effective archiving directory where it can be mapped to the older encoding without having to reencode the old content. 2012/11/26 Marc Durdin marc.dur...@tavultesoft.commailto:marc.dur...@tavultesoft.com Somewhat ironically, both Firefox and Internet Explorer 9, on my machine at least, detect this page is encoded with ISO-8859-1 and cp-1252 respectively, instead of UTF-8. It seems they both ignore the XML prolog which is the only place where the encoding is stated. From: unicode-bou...@unicode.orgmailto:unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.orgmailto:unicode-bou...@unicode.org] On Behalf Of John H. Jenkins Sent: Tuesday, 27 November 2012 1:15 AM To: Unicode Mailing List Subject: Re: xkcd: LTR Or, if one prefers: http://www.井作恆.net/XKCD/1137.html On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.commailto:golds...@apple.com wrote: http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie
Re: xkcd: LTR
Also I really don't like the Deseret font: {font-family: CMU; src: url(CMUSerif-Roman.ttf) format(truetype);} that you have inserted in your stylesheet (da.css) which is used to display the whole text content of the page, including the English Latin text at the bottom part. This downloaded font is difficult to read as it is not hinted at all (so its rendering on screen is extremely poor, we probably don't want to print each page of this XKCD series, when the main interest is the image which is perfectly readable). Could you ask to someone in this list to help you hinting this font a minimum (even basic autohinting would be much better). 2012/11/27 Philippe Verdy verd...@wanadoo.fr Did you try add the xml:lang=en-Dsrt pseudo-attribute to the html element, as suggested by the W3C Unicorn validator ? http://validator.w3.org/unicorn/check?ucn_uri=www.xn--elqus623b.net%2FXKCD%2F1138.htmlucn_lang=frucn_task=conformance# May be this could help IE and Firefox that can't figure out the language used to properly detect the encoding if they still don't trust the XML declaration in this case, to avoid them to use an encoding guesser. It is anyay curious because this site is valid as XHTML 1.1 (not as HTML5 which uses a very different and simplified prolog, which is not matched here, so the legacy rules should apply to detect XHTML here, then legacy HTML4 if XHTML is no longer recognized by IE and Firefox). Because XHTML is properly tagged, the XML requirements should apply and the XML declaration in the prolog should be used without needing to guess the encoding from the rest of the content (starting by a meta element in the HTML head element). 2012/11/27 John H. Jenkins jenk...@apple.com That's because the domain does, in fact, use sinograms and not Deseret. (It's my Chinese name.) On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote: I wonder why this IDN link appears to me using sinograms in its domain name, instead of Deseret letters. The link works, but my browser cannot display it and its displays the Punycoded name instead without decoding it. This is strange because I do have Deseret fonts installed and I can view Unicoded HTML pages containing Deseret letters. 2012/11/26 John H. Jenkins jenk...@apple.com Or, if one prefers: http://www.井作恆.net/XKCD/1137.htmlhttp://www.xn--elqus623b.net/XKCD/1137.html On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote: http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie
Re: xkcd: LTR
Did you try add the xml:lang=en-Dsrt pseudo-attribute to the html element, as suggested by the W3C Unicorn validator ? http://validator.w3.org/unicorn/check?ucn_uri=www.xn--elqus623b.net%2FXKCD%2F1138.htmlucn_lang=frucn_task=conformance# May be this could help IE and Firefox that can't figure out the language used to properly detect the encoding if they still don't trust the XML declaration in this case, to avoid them to use an encoding guesser. It is anyay curious because this site is valid as XHTML 1.1 (not as HTML5 which uses a very different and simplified prolog, which is not matched here, so the legacy rules should apply to detect XHTML here, then legacy HTML4 if XHTML is no longer recognized by IE and Firefox). Because XHTML is properly tagged, the XML requirements should apply and the XML declaration in the prolog should be used without needing to guess the encoding from the rest of the content (starting by a meta element in the HTML head element). 2012/11/27 John H. Jenkins jenk...@apple.com That's because the domain does, in fact, use sinograms and not Deseret. (It's my Chinese name.) On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote: I wonder why this IDN link appears to me using sinograms in its domain name, instead of Deseret letters. The link works, but my browser cannot display it and its displays the Punycoded name instead without decoding it. This is strange because I do have Deseret fonts installed and I can view Unicoded HTML pages containing Deseret letters. 2012/11/26 John H. Jenkins jenk...@apple.com Or, if one prefers: http://www.井作恆.net/XKCD/1137.htmlhttp://www.xn--elqus623b.net/XKCD/1137.html On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote: http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie
Re: xkcd: LTR
Anyway, you could at least use Segoe UI before your CMU font, even if Segoe UI works only in Windows, but it has a decent support for Deseret. May be there's a good font also on your Mac that ships with some recent version of Mac OS, which you could list too. Leaving your CMU after them, only for other OSes. In all cases, I also suggest that you could tag only the parts that are written in Deseret with the xml:lang=en-Dsrt, so that you can have a CSS selector to match these Deseret fonts. For the rest, just use your choice of Lucida, Arial, sans-serif in less selective CSS selectors (that don't care about the language tags). The template design of these pages are simple enough that you can do it with just a few modifications. 2012/11/27 Philippe Verdy verd...@wanadoo.fr Also I really don't like the Deseret font: {font-family: CMU; src: url(CMUSerif-Roman.ttf) format(truetype);} that you have inserted in your stylesheet (da.css) which is used to display the whole text content of the page, including the English Latin text at the bottom part. This downloaded font is difficult to read as it is not hinted at all (so its rendering on screen is extremely poor, we probably don't want to print each page of this XKCD series, when the main interest is the image which is perfectly readable). Could you ask to someone in this list to help you hinting this font a minimum (even basic autohinting would be much better). 2012/11/27 Philippe Verdy verd...@wanadoo.fr Did you try add the xml:lang=en-Dsrt pseudo-attribute to the html element, as suggested by the W3C Unicorn validator ? http://validator.w3.org/unicorn/check?ucn_uri=www.xn--elqus623b.net%2FXKCD%2F1138.htmlucn_lang=frucn_task=conformance# May be this could help IE and Firefox that can't figure out the language used to properly detect the encoding if they still don't trust the XML declaration in this case, to avoid them to use an encoding guesser. It is anyay curious because this site is valid as XHTML 1.1 (not as HTML5 which uses a very different and simplified prolog, which is not matched here, so the legacy rules should apply to detect XHTML here, then legacy HTML4 if XHTML is no longer recognized by IE and Firefox). Because XHTML is properly tagged, the XML requirements should apply and the XML declaration in the prolog should be used without needing to guess the encoding from the rest of the content (starting by a meta element in the HTML head element). 2012/11/27 John H. Jenkins jenk...@apple.com That's because the domain does, in fact, use sinograms and not Deseret. (It's my Chinese name.) On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote: I wonder why this IDN link appears to me using sinograms in its domain name, instead of Deseret letters. The link works, but my browser cannot display it and its displays the Punycoded name instead without decoding it. This is strange because I do have Deseret fonts installed and I can view Unicoded HTML pages containing Deseret letters. 2012/11/26 John H. Jenkins jenk...@apple.com Or, if one prefers: http://www.井作恆.net/XKCD/1137.htmlhttp://www.xn--elqus623b.net/XKCD/1137.html On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote: http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie
xkcd: LTR
http://xkcd.com/1137/ Finally, an xkcd for Unicoders. :-) Debbie