subject:"xkcd\: ‮LTR"

So we would be in a case where it's impossible to warranty full
compatiblity or interoperability between the two concurrent standards from
the same standard body, and promissing the best interoperoperability with
past flavors of HTML (those past flavors are still not in the past
given that two of them are definitely not deprecated for now but fully
recommended, and HTML5 is still with the draft status).

HTML5 would contradict with everyone else, and only HTML5.

But I still think that the discriminant factor for HTML5 is its exclusive
(and **mandatory**) document declaration: if it is absent for any reason
there's absolutely NO reason to continue using an HTML5 parser, and
browsers must then either:
- fallback to using another legacy parser, or
- use an HTML5 parser (if this is the only one you have) working in a more
lenient mode, to recognize at least the XML prolog and the legacy SGML
document declaration for HTML or XHTML), and at lest recognize the encoding
in the XML prolog when it is present.

This second option (more lenient parsing by the HTML5 parser) should be
documented and vecome part of this future standard (still not finalized).

In my opinion the XML parser is definitely not a legacy parser, it is
present in all browsers for lots of services and applications. And it is
even needed to support HTML5 in its XHTML serialization syntax (which is
explicitly supported).

For me, it is normal that the Unicorn validator does not integrate HTML5,
given its draft status. So there's still a separate validator (which is
also working in beta version, given the draft status of HTML5) which is
still not integrable in Unicorn.

But given the huge developments already made on the web with HTML5, it
becomes urgent to fix these interoperability issues, before the final
release of HTML5 : the existing major browsers are already modified
constantly to follow the state of this draft, it will not be difficult for
them to implement the missing interoperability rules and the sonner it will
be done, the sooner webdesigners will be guided. (And in this case, the
Beta nu validator of the W3C will start being integrable in Unicorn which
remains the best validator from everything ; nu cannot be trusted for
now, and it does not even return a conformance logo in its result, given
that conformance rules are still not fully tested and specified in HTML5).

HTML5 remains for now an important project, but is still not a standard by
itself.




2012/11/28 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no

 Philippe Verdy, Wed, 28 Nov 2012 11:02:45 +0100:
  In this case, Firefox and IE should not even be able to render
  *any* XHTML page because it violates the HTML5 standard.

 (1) The page in question (http://www.xn--elqus623b.net/XKCD/1137.html)
 is (from a source code point of view) a pure XHTML page, and contains
 no HTML-compatible methods for declaring the encoding. And therefore,
 that page does indeed violate the HTML5 standard, with the result that
 browsers are permitted to fall back to their built-in default encodings.

 (2) According to XML, the XML prologue can be deleted for UTF-8 encoded
 pages. And when it is deleted/omitted, XML parsers assume that the page
 is UTF-8 encoded. And if you try that (that is: if you *do* delete the
 XML prologue from that page), then you will see that the Unicorn
 validator will *continue* to stamp that Web page as error free. This is
 because the Unicorn validator only considers the rules for XML - it
 doesn't consider the rules of HTML.

 (4) Also, when you do delete the XML prologue, then not only Firefox
 and IE will render the page in the wrong encoding, but even Safari.
 However, Opera and Chrome will continue to render the page as UTF-8 due
 to the UTF-8 sniffing that they cleverly have built in. Clearly, Opera
 and Chrome's behaviour is the way to go.

 (5) It is indeed backwards that the W3C Unicorn validator doesn't
 inform its users when their pages fail to include a HTML-compatible
 method for declaring the encoding. This suboptimal validation could
 partly be related to libxml2, which Unicorn is partly based on. Because
 - as it turns out - the command line tool xmllint (which is part of
 libxml2) shows a very similar behaviour to that of Unicorn: It pays no
 respect to the fact that the MIME type (or Content-Type:) is
 'text/html' and not an XML MIME type. In fact, when you do delete the
 XML prologue, Unicorn issues this warning (you must click to make it
 visible): No Character Encoding Found! Falling back to UTF-8. Which
 is a quite confusing message to send given that HTML parser does not,
 as their last resort, fall back to UTF-8.
 --
 leif halvard silli

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

Philippe Verdy, Thu, 29 Nov 2012 10:11:13 +0100:

 So we would be in a case where it's impossible to warranty full
 compatiblity or interoperability between the two concurrent standards from
 the same standard body, and promissing the best interoperoperability with
 past flavors of HTML (those past flavors are still not in the past
 given that two of them are definitely not deprecated for now but fully
 recommended, and HTML5 is still with the draft status).

Section 5.1 of XHTML 1.0 says: [1] 'XHTML Documents which follow the 
guidelines set forth in Appendix C, HTML Compatibility Guidelines may 
be labeled with the Internet Media Type text/html'

And Appendix C, point 9 of XHTML 1.0 says: [2] 'the best approach is to 
ensure that the web server provides the correct headers. If this is not 
possible, a document that wants to set its character encoding 
explicitly must include [ snip ] a meta http-equiv statement (e.g., 
meta http-equiv=Content-type content=text/html; charset=EUC-JP 
/).'

 For me, it is normal that the Unicorn validator does not integrate HTML5,
 given its draft status.

The strange thing is that Unicorn doesn't integrate XHTML1. [1][2]

[1] http://www.w3.org/TR/xhtml1/#media
[2] http://www.w3.org/TR/xhtml1/#C_9
-- 
leif halvard silli

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

You're wrong. XHTML1 is integrated in the W3C validator and recognized
automatically.
The document you cite in the XHTML1 specs has just not been updated.

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.xn--elqus623b.net%2FXKCD%2F1137.htmlcharset=%28detect+automatically%29doctype=Inlinegroup=0

Anyway this http://www.xn--elqus623b.net/XKCD/1137.html site is actually
using XHTML1.1 (in its strict schema, not a transitional schema)



2012/11/29 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no

 Philippe Verdy, Thu, 29 Nov 2012 10:11:13 +0100:

  So we would be in a case where it's impossible to warranty full
  compatiblity or interoperability between the two concurrent standards
 from
  the same standard body, and promissing the best interoperoperability with
  past flavors of HTML (those past flavors are still not in the past
  given that two of them are definitely not deprecated for now but fully
  recommended, and HTML5 is still with the draft status).

 Section 5.1 of XHTML 1.0 says: [1] 'XHTML Documents which follow the
 guidelines set forth in Appendix C, HTML Compatibility Guidelines may
 be labeled with the Internet Media Type text/html'

 And Appendix C, point 9 of XHTML 1.0 says: [2] 'the best approach is to
 ensure that the web server provides the correct headers. If this is not
 possible, a document that wants to set its character encoding
 explicitly must include [ snip ] a meta http-equiv statement (e.g.,
 meta http-equiv=Content-type content=text/html; charset=EUC-JP
 /).'

  For me, it is normal that the Unicorn validator does not integrate HTML5,
  given its draft status.

 The strange thing is that Unicorn doesn't integrate XHTML1. [1][2]

 [1] http://www.w3.org/TR/xhtml1/#media
 [2] http://www.w3.org/TR/xhtml1/#C_9
 --
 leif halvard silli

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

Philippe Verdy, Thu, 29 Nov 2012 13:26:28 +0100:
 You're wrong. XHTML1 is integrated in the W3C validator and 
 recognized automatically.

Indeed, yes. What I meant by doesn't integrate XHTML1' was that 
Unicorn doesn't 100% adhere to the two sections of XHTML1 that I 
quoted.[1][2]

 The document you cite in the XHTML1 specs has just not been updated.

The validator must of course implement what XHTML1 says.

 Anyway this http://www.xn--elqus623b.net/XKCD/1137.html site is 
 actually using XHTML1.1 (in its strict schema, not a transitional 
 schema)

A relevant point, of course. But XHTML11 says the same thing:

[3] 'XHTML 1.1 documents SHOULD be labeled with the Internet Media Type 
application/xhtml+xml as defined in [RFC3236]. For further 
information on using media types with XHTML, see the informative note 
[XHTMLMIME].' 

The XHTMLMIME note says:

[4] 'The 'text/html' media type [RFC2854] is primarily for HTML, not 
for XHTML. In general, this media type is NOT suitable for XHTML except 
when the XHTML is conforms to the guidelines in Appendix A.'

[5] 'DO set the encoding via a meta http-equiv statement in the 
document (e.g., meta http-equiv=Content-Type content=text/html; 
charset=utf-8 /)'

[1] http://www.w3.org/TR/xhtml1/#media
[2] http://www.w3.org/TR/xhtml1/#C_9
[3] http://www.w3.org/TR/xhtml11/xhtml11.html#strict
[4] http://www.w3.org/TR/xhtml-media-types/#text-html
[5] http://www.w3.org/TR/xhtml-media-types/#C_9
-- 
leif halvard silli

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

And you forget the important part of Appendix A:

*Consequence*: Remember, however, that when the XML declaration is not
included in a document, AND the character encoding is not specified by a
higher level protocol such as HTTP, the document can only use the default
character encodings UTF-8 or UTF-16. See, however, guideline
9http://www.w3.org/TR/xhtml-media-types/#C_9
 below.

Here we have an XHTML site that is already encoded with the default UTF-8.
There's no reason then for Firefox or IE to render it with windows-1252,
even if they ignore the XML prolog. the text/html content-type remains
appropriate for XHTML 1.0, 1.1 or 5.0. The other Content-Type is
text/xml+xhtml and similar types for integrating other XML schemas, but
it is only appropriate if you need another schema than just XHTML, or if
you want to integrate the support for an external or internal non-standard
DTD, or you want to integrate the support for XML processing instructions
(including XML schemas not used here, or XML stylesheets, which is the case
here for rendering its technical code when viewing the source but not for
rendering the described page content itself).

The problem here is the guideline 9 which is not part of the standard,
and which uses one of the worst part of HTML, meta elements ; this was
partly ill-designed as an empty element, and that binds the content-type to
override it and forces the reparsing  from start, after parsing all or part
of other required elements (html, body, head, title).

But why ? Isn't UTF-8 (or alternatively UTF-16) already the default
encoding of XHTML?

If not, then we should file a bug in the W3C Validator for not honoring the
Guideline 9 (even though it is not part of the standard itself, but just a
recommendation, it should issue at least a warning).


2012/11/29 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no

 Philippe Verdy, Thu, 29 Nov 2012 13:26:28 +0100:
  You're wrong. XHTML1 is integrated in the W3C validator and
  recognized automatically.

 Indeed, yes. What I meant by doesn't integrate XHTML1' was that
 Unicorn doesn't 100% adhere to the two sections of XHTML1 that I
 quoted.[1][2]

  The document you cite in the XHTML1 specs has just not been updated.

 The validator must of course implement what XHTML1 says.

  Anyway this http://www.xn--elqus623b.net/XKCD/1137.html site is
  actually using XHTML1.1 (in its strict schema, not a transitional
  schema)

 A relevant point, of course. But XHTML11 says the same thing:

 [3] 'XHTML 1.1 documents SHOULD be labeled with the Internet Media Type
 application/xhtml+xml as defined in [RFC3236]. For further
 information on using media types with XHTML, see the informative note
 [XHTMLMIME].'

 The XHTMLMIME note says:

 [4] 'The 'text/html' media type [RFC2854] is primarily for HTML, not
 for XHTML. In general, this media type is NOT suitable for XHTML except
 when the XHTML is conforms to the guidelines in Appendix A.'

 [5] 'DO set the encoding via a meta http-equiv statement in the
 document (e.g., meta http-equiv=Content-Type content=text/html;
 charset=utf-8 /)'

 [1] http://www.w3.org/TR/xhtml1/#media
 [2] http://www.w3.org/TR/xhtml1/#C_9
 [3] http://www.w3.org/TR/xhtml11/xhtml11.html#strict
 [4] http://www.w3.org/TR/xhtml-media-types/#text-html
 [5] http://www.w3.org/TR/xhtml-media-types/#C_9
 --
 leif halvard silli

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

Philippe Verdy, Thu, 29 Nov 2012 14:24:29 +0100:
 And you forget the important part of Appendix A:
 
 *Consequence*: Remember, however, that when the XML declaration is not
 included in a document, AND the character encoding is not specified by a
 higher level protocol such as HTTP, the document can only use the default
 character encodings UTF-8 or UTF-16. See, however, guideline
 9http://www.w3.org/TR/xhtml-media-types/#C_9
  below.
 
 Here we have an XHTML site that is already encoded with the default UTF-8.
 There's no reason then for Firefox or IE to render it with windows-1252,
 even if they ignore the XML prolog. the text/html content-type remains
 appropriate for XHTML 1.0, 1.1 or 5.0.

Note that point 1, which you quoted,[1] and all the rest of the entire 
note, is about how *authors* should behave when they create XHTML 
documents. The note is *not* about how user agents should behave. Also 
note that what you refer to as the important part of Appendix A ends 
in a sentence that points to guideline 9, which in turn tells authors 
to 'DO set the encoding via a meta http-equiv' and note that the 
example in guideline 9 uses UTF-8 as example, quote: '(e.g., meta 
http-equiv=Content-Type content=text/html; charset=utf-8 /)'.

 ...
 But why ? Isn't UTF-8 (or alternatively UTF-16) already the default
 encoding of XHTML?
 
 If not, then we should file a bug in the W3C Validator for not honoring the
 Guideline 9 (even though it is not part of the standard itself, but just a
 recommendation, it should issue at least a warning).

This is exactly the problem. Your if not does apply! Because, if one 
presents a XHTML document to the browser as HTML, then then 
windows-1252 - and not UTF-8 - becomes the default encoding. And, in 
fact, as consequence of our dialog, I have notified the developers of 
Unicorn about the shortcoming, asking them to issue a warning.

[1] http://www.w3.org/TR/xhtml-media-types/#C_1
[2] http://www.w3.org/TR/xhtml-media-types/#C_9
-- 
leif halvard silli

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

2012/11/29 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no

 Philippe Verdy, Thu, 29 Nov 2012 14:24:29 +0100:
  ...
  But why ? Isn't UTF-8 (or alternatively UTF-16) already the default
  encoding of XHTML?
 
  If not, then we should file a bug in the W3C Validator for not honoring
 the
  Guideline 9 (even though it is not part of the standard itself, but just
 a
  recommendation, it should issue at least a warning).

 This is exactly the problem. Your if not does apply! Because, if one
 presents a XHTML document to the browser as HTML, then then
 windows-1252 - and not UTF-8 - becomes the default encoding. And, in
 fact, as consequence of our dialog, I have notified the developers of
 Unicorn about the shortcoming, asking them to issue a warning.


Thanks a lot, this was really hard to see and understand, because I was
only reading the XHTML specs, and the Validator did not complain.

As a side note, the Unicorn Validator which senses the content-type (in
its simple interface) will still sense an XHTML content which remains valid
by itself. The issue is only when it is presented as HTML, and this
validator should allow seeing the effect when using HTML parsers (HTML4 or
HTML5) on XHTML documents, by offering the way to select another document
type than the autodetected one (XHTML here), if ever the warning is
displayed. Because the XHTML document may not validate at all when parsed
as HTML (in which case it will first issue warnings about the presence of
XML prologs (which are generally not a problem as they are typically
ignored in browsers), but an error about XML processing instructions (I
don't think that the optional leading XML declaration is a processing
instruction), or an error about non-conforming document declaration
(according to the selected HTML flavor: HTML4 or HTML5.

Anyway, we can expect this page design error will be frequent, and HTML5
parsors should still better not discard the XML declaration, but at least
recognize its encoding pseudo-attribute (even if the processing continues
using HTML rules and not XML rules), instead of relying on the presence of
the meta element, which is really ugly and forces the reparsing using the
detected encoding instead of the default windows-1252 (this is
unnecessarily slow).

Making this Guideline 9 only applicable to past flavors of HTML before
HTML5 when it will be released. In that case the warning issued by the
Validator would only apply to HTML5 or before, but not HTML5. This will
increase the comparibility of HTML5 to parse valid XHTML1 and XHTML5
documents simply created or modified by XML or XHTML editors.

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

In my opinion, from HTML5, and not XHTML5, there should also exist a
leading prolog like

?html version=5.0 encoding=utf-8

For XHTML5, we will continue using the XML prolog ; but it *may* be
followed by the html prolog, without needing to repeat the optional
encoding pseudo-attribute, which XML parsers will treat as a parsing
instruction:

?xml version=1.0 encoding=utf-8
?html version=5.0

The absence of these prologs, will use the default encoding of each parser.
Autosensing of document types wil remain possible and HTML5 will also no
longer be dependant of transport protocols or to the very ugly meta
http-equiv=Content-type value=text/html;charset=utf-8 element which
forces the reparsing.

The pseudo DOCTYPE tentatively introduced in HTML5 which breaks in SGML
parsers and in past HTML parsers, should be eliminated from HTML5 if the
HTML prolog is present (the HTML prolog would be highly prefered, including
with its useful versioning).

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

Philippe Verdy, Thu, 29 Nov 2012 16:10:14 +0100:
 Thanks a lot, this was really hard to see and understand, because I 
 was only reading the XHTML specs, and the Validator did not complain.

Glad to find we are no the same page!

Philippe Verdy, Thu, 29 Nov 2012 16:27:13 +0100:
 ?html version=5.0 encoding=utf-8

HTML5 already have 4 *conforming* methods for setting the UTF-8 
encoding: 

1. byte-order mark
2. HTTP server,
   Content-Type:text/html;charset=UTF-8
3. meta http-equiv,
   meta http-equiv=Content-Type content=text/html;charset=UTF-8/
4. meta charset,
   meta charset=UTF-8/
   (Note that there is no content-type here, and thus the meta charset
   method is more clean to use in a file served as XHTML.)

In addition, other things have effect:

6. Sniffing is an official, but largely unimplemented method for
   getting the encoding (Chrome and Opera use it, and Firefox
   has it as an option and also uses it by default for some locales.)
7. The XML prologue (sic) takes effect in *some* browsers. 
8. Simply serving the page as application/xhtml+xml is
   yet another method of setting the encoding to UTF-8.

Thus I can guarantee you that your idea about at method number 9, is 
not going to be met with enthusiasm.
-- 
leif halvard silli

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

- Method 1 (the BOM) is only goof for UTF-16. not reliable for UTF-8 whuch
is still the default for XHTML (and where the BOM is not always present).
- Method 2 is working sometimes, but is not practicle for many servers that
you can't configure to change their content-type for specific pages all
having the same *.html extension or relayed by some proxies, it is also
dependant on the transport layer (HTTP here) to be capbel of offering it
(HTML files in file systems do not provide the info). Bit if it is
implemented it will take precedence, possibly indicating that the document
was reencoded (by a proxy for example).
- Method 3 and 4 are completely equivalent and share the same problem :
they require restarting the parsing. They are equally ugly (just like all
empty meta elements in the HTML header or in the body) intriducing another
attribute to the meta element (which already has name, http-equiv, and now
charset) is also a bad idea (data encoded in attributes that are part of
the document root, breaks the concept of what is metadata); it also forbids
the reencoding of the document during processing, if the document is
digitally signed for its content, independantly of its encoding: to check
the document signature, you would not only have to parse it completely up
to the DOM level, but also ignore these specific meta elements (but not all
meta elements like links)

- Method 5 is where ?

- Method 6 (sniffing) is a transitory solution (as long as HTML5 is not
released) or last chance paliative solution based only on an heuristic,
which fails sometimes. Not reliable.

- Method 7 (using the XML prolog) is excellent for XML. It will reliably
work with XHTML5, without needing reparsing.

- Method 8 (content-type set as application/xhtml+xml  in the transport
layer) is exactly like method 2 (and suffers the same problem), but the
content-type is not really intended for HTML5, not even XHTML5 as it
implies an application and the extensible schema that XHTML5 will not
parse. Method 8 for me implies the forced use of an XML parser, not an HTML
parser. All XML extensions (including namespaces) will be valid


My method is a generalisation to HTML of the excellent method 7 for XHTML
(based on its standard and the XML standard). It requires absolutely no
reparsing, and supports the explicit versioning of HTML (for future
evolutions of its supported schema), without overwriting the independant
versioning of XML if it is used. As well it does not require the new ugly
DOCTYPE which indicates absolutely nothing signiicant, will not allow
versioning, and breaks SGML parsers as well as XML parsers. It takes
benefit of the fact that they don't break browsers in method 7 (even if
some of them do not sniff at least the encoding from the XML prolog).



2012/11/29 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no

 Philippe Verdy, Thu, 29 Nov 2012 16:10:14 +0100:
  Thanks a lot, this was really hard to see and understand, because I
  was only reading the XHTML specs, and the Validator did not complain.

 Glad to find we are no the same page!

 Philippe Verdy, Thu, 29 Nov 2012 16:27:13 +0100:
  ?html version=5.0 encoding=utf-8

 HTML5 already have 4 *conforming* methods for setting the UTF-8
 encoding:

 1. byte-order mark
 2. HTTP server,
Content-Type:text/html;charset=UTF-8
 3. meta http-equiv,
meta http-equiv=Content-Type content=text/html;charset=UTF-8/
 4. meta charset,
meta charset=UTF-8/
(Note that there is no content-type here, and thus the meta charset
method is more clean to use in a file served as XHTML.)

 In addition, other things have effect:

 6. Sniffing is an official, but largely unimplemented method for
getting the encoding (Chrome and Opera use it, and Firefox
has it as an option and also uses it by default for some locales.)
 7. The XML prologue (sic) takes effect in *some* browsers.
 8. Simply serving the page as application/xhtml+xml is
yet another method of setting the encoding to UTF-8.

 Thus I can guarantee you that your idea about at method number 9, is
 not going to be met with enthusiasm.
 --
 leif halvard silli

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

Philippe Verdy, Thu, 29 Nov 2012 19:11:42 +0100:
 2012/11/29 Leif Halvard Silli:

 Philippe Verdy, Thu, 29 Nov 2012 16:27:13 +0100:
 ?html version=5.0 encoding=utf-8

 Thus I can guarantee you that your idea about at method number 9, is
 not going to be met with enthusiasm.

 - Method 5 is where ?

Sorry. So your method is just method number 8, then.

 My method is a generalisation to HTML of the excellent method 7

I have given you my verdict. This topic is over for my part. Thanks for 
the exchange!
-- 
leif halvard silli

Re: UTF-8 isn't the default for HTML (was: xkcd: LTR)

Note that I challenge the term conforming you use, given that HTML5 is
still not released, so its conformance is still not formally defined. The
nu validator is still expliitly marked by the W3C as experimental.


2012/11/29 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no

 HTML5 already have 4 *conforming* methods for setting the UTF-8
 encoding:

Re: xkcd: LTR

2012-11-28 Thread Philippe Verdy

2012/11/28 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no

 Philippe Verdy, Wed, 28 Nov 2012 04:50:06 +0100:
  detects a violation of the required
  extended prolog (sorry, the HTML5 document declaration, which is not
 a
  valid document declaration for XHTML or for HTML4 or before or even
 for
  SGML, due to the unspecified schema after the shema short name), it
  should catch this exception to try another parser.
 
  There is no spec, that I am aware of, that says that it should do that.
 
  But this is in the scope of the HTML5 whose claimed purpose is to become
  compatible with documents encoded in all previous flavors of HTML.

 I admit that understanding the meaning behind all the slogans about
 HTML5, can be be demanding. But the goal has all the time been to
 create a *single* HTML parser, and not to introduce switching between
 multiple HTML parsers. If you think otherwise, then my claim is that
 you have misunderstood.


In this case, Firefox and IE should not even be able to render *any* XHTML
page because it violates the HTML5 standard. It still attempts to recover
from it, recognizing a part of XHTML, but not an essential one : its very
basic XML prolog (and the XHTML document declaration), up to the point
where they start seeing the root element.

But then how can they claim supporting XHTML when they don't (and when the
XHTML syntax is still part of HTML5, which makes affirmations like yours –
not to introduce switching between multiple HTML parsers – very weak and
difficult to defend).

If the intent is to be able to parse all flavors of HTML (at least a basic
profile of them) with the same parser, then a behavior must be standardized
in HTML5 to correctly handle the possible presence of XML prologs and
standard SGML document declarations even if their contents are skipped and
ignored (notably here when it is used to specify that the document is
effectively encoded in UTF-8 and not cp1252, both encodings being supported
by HTML5 but without other compatibility problems when it is UTF-8)

But ignoring XHTML document declarations will have an impact on
compatibility, if there's an external or internal DTD and this should be
documented in HTML5 by limting the claims of compatibility (and suggesting
then another recovery mechanism for these unsupported parts of XHTML, using
a true XML parser in case of violation of the required HTML5 DOCTYPE
declaration).

For now this breaks the interoperability with the basic profile of XHTML,
more or less compatible with HTML4 including the deprecated elements (but
without the modular extension design, and without support of XML
namespaces).

Now the argument saying that meta elements may be used in the HTML
document header (to replace missing HTTP MIME headers), this contradicts
all what was done to deprecate this meta element usage before. And here
also, HTML5 is not clear about this change of position. And meta elements
won't make HTML parsers simpler to implement : they will need to reparse
the document from the beginning.

The XML prolog of XHTML is much simpler to parse than the meta element, and
can be parsed directly by the HTML5-only parser, which can as well as
accept at least the XHTML1 document declaration (without internal DTD) as
acceptable for this HTML5 parser (it should fail however if there's an
internal DTD or if the SGML catalog name is not one of those for HTML or
XHTML; it should just check the SGML catalog name partly ignoring the
flavor precision in its name, as there is no internal or external DTD
supported in HTML4 or lower; it should silently ignore the URL for an
external DTD in XHTML, including when XHTML is used as the alternate
serialization syntax for HTML5, even if this will cause some defined
entities defined in the external DTD not being replaced, but the result  of
the HTML5 parser with undefined entities or entities defined differently
will be unpredictable).

If an implementation can support both parsers, the more compatible recovery
mode will be to use the XML parser, instead of using this simple heuristic.

Browsers already support multiple text-encoded document parsers, including
for HTML5 (Javascript, JSON, CSS, SVG/XML, P3P, URI...), plus binary
parsers for various media codecs  (PNG, GIF, JPEG, WAV, ICO, OpenPDF...) if
they can embed them instead of using OS-supported codecs or plugins (MPEG,
Ogg...), and data codecs (compressors, encryptors,, archive formats... for
transport and security protocol layers referenced in URI schemes). What
else ?

In all popular browsers, the XML parser is still present, since long now,
to support XML requests (and lots of GUI or configuration features, such as
XUL in Firefox, VML in IE, external SVG images, local DB stores, support
library for third-party addons...), even if JSON requests are highly
preferred now, sometimes more secure, but much simpler and faster to parse
(and more compact in their serialization).

UTF-8 isn't the default for HTML (was: xkcd: LTR)

2012-11-28 Thread Leif Halvard Silli

Philippe Verdy, Wed, 28 Nov 2012 11:02:45 +0100:
 In this case, Firefox and IE should not even be able to render
 *any* XHTML page because it violates the HTML5 standard.

(1) The page in question (http://www.xn--elqus623b.net/XKCD/1137.html) 
is (from a source code point of view) a pure XHTML page, and contains 
no HTML-compatible methods for declaring the encoding. And therefore, 
that page does indeed violate the HTML5 standard, with the result that 
browsers are permitted to fall back to their built-in default encodings.

(2) According to XML, the XML prologue can be deleted for UTF-8 encoded 
pages. And when it is deleted/omitted, XML parsers assume that the page 
is UTF-8 encoded. And if you try that (that is: if you *do* delete the 
XML prologue from that page), then you will see that the Unicorn 
validator will *continue* to stamp that Web page as error free. This is 
because the Unicorn validator only considers the rules for XML - it 
doesn't consider the rules of HTML.

(4) Also, when you do delete the XML prologue, then not only Firefox 
and IE will render the page in the wrong encoding, but even Safari. 
However, Opera and Chrome will continue to render the page as UTF-8 due 
to the UTF-8 sniffing that they cleverly have built in. Clearly, Opera 
and Chrome's behaviour is the way to go.

(5) It is indeed backwards that the W3C Unicorn validator doesn't 
inform its users when their pages fail to include a HTML-compatible 
method for declaring the encoding. This suboptimal validation could 
partly be related to libxml2, which Unicorn is partly based on. Because 
- as it turns out - the command line tool xmllint (which is part of 
libxml2) shows a very similar behaviour to that of Unicorn: It pays no 
respect to the fact that the MIME type (or Content-Type:) is 
'text/html' and not an XML MIME type. In fact, when you do delete the 
XML prologue, Unicorn issues this warning (you must click to make it 
visible): No Character Encoding Found! Falling back to UTF-8. Which 
is a quite confusing message to send given that HTML parser does not, 
as their last resort, fall back to UTF-8.
-- 
leif halvard silli

Re: xkcd: ‮LTR

2012-11-27 Thread Simon Montagu


On 11/26/2012 08:42 PM, Marc Durdin wrote:

Somewhat ironically, both Firefox and Internet Explorer, on my machine
at least, detect this page is encoded with ISO-8859-1 and cp-1252
respectively, instead of UTF-8.  It seems they both ignore the XML
prolog which is the only place where the encoding is stated.


Firefox follows the HTML5 spec and ignores the XML prolog, since the 
Content-type is text/html.

Re: xkcd: ‮LTR

2012-11-27 Thread Behnam Esfahbod ZWNJ

Simon,

There's no sign of HTML5 on that page. The head of the file matches all
XHTML 1.1 requirements and passes all checks on validator.w3.org. Now, why
would Firefox follow anything from HTML5 spec here?

-Behnam



On Tue, Nov 27, 2012 at 3:37 AM, Simon Montagu smont...@smontagu.orgwrote:

 On 11/26/2012 08:42 PM, Marc Durdin wrote:

 Somewhat ironically, both Firefox and Internet Explorer, on my machine
 at least, detect this page is encoded with ISO-8859-1 and cp-1252
 respectively, instead of UTF-8.  It seems they both ignore the XML
 prolog which is the only place where the encoding is stated.


 Firefox follows the HTML5 spec and ignores the XML prolog, since the
 Content-type is text/html.




-- 
Behnam Esfahbod | بهنام اسفهبد
http://behnam.es/
http://zwnj.behnam.es/
GPG Fingerprint: 3E7F B4B6 6F4C A8AB 9BB9 7520 5701 CA40 259E 0F8B

Re: xkcd: ‮LTR

2012-11-27 Thread Simon Montagu


On 11/27/2012 11:19 AM, Behnam Esfahbod ZWNJ wrote:

Simon,

There's no sign of HTML5 on that page. The head of the file matches all
XHTML 1.1 requirements and passes all checks on validator.w3.org
http://validator.w3.org. Now, why would Firefox follow anything from
HTML5 spec here?


As I already said, because of the Content-Type HTTP header

Re: xkcd: LTR

HTML5 does not reference the Content-Type: text/html header as enough to
qualify as meaning HTML5.
HTML5 **requires** its own prolog (i.e. its basic document declaration
**within** the document itself, for the HTML syntax, or its FULL document
declaration for the XML/XHTML syntax).
So Firefox is wrong and attempts to use HTML5 to render all HTML dialects.


2012/11/27 Simon Montagu smont...@smontagu.org

 On 11/27/2012 11:19 AM, Behnam Esfahbod ZWNJ wrote:

 Simon,

 There's no sign of HTML5 on that page. The head of the file matches all
 XHTML 1.1 requirements and passes all checks on validator.w3.org
 http://validator.w3.org. Now, why would Firefox follow anything from
 HTML5 spec here?


 As I already said, because of the Content-Type HTTP header

Re: xkcd: LTR

I've never said that user agents had to 'write the prolog. It's the
reverse: yes authors have to write a prolog (but the prolog is perfect here
so this is not the fault of the author). Why do have to use this prolog,
it's exactly because user agents will have to read it (not write it),
as it is expected for validating that this is effectively an HTML5 content
(the Content-Type: text/html is clearly not enough, it is exactly the
same as HTML4 or all past versions of HTML, working in quirk mode or not).

By your assertion, all HTML5 browsers would then need to parse HTML4 as if
it was HTML5, using its strict definition that are not compatible with
HTML4 (even if we ignore the quirks mode), or all past versions. HTML5
parsing is triggered by the presence of the required HTML5 prolog.

Re: xkcd: LTR

Also you make a confusion in the sense that HTML5 must be able to parse
HTML4.

This is true, but this does not mean that they will be able to render it
fully. HTML5 is not fully upward compatible with past versions (and the
case of the identification of encodings is an example where it is
different, and many requirements of HTML4 are no longer requirements in
HTML5 due to some relaxed rules after the faield effort to standardize
HTML4 more like XHTML and according to the initial CSS specifications).

So HTML5 renderers will just render HTML4 in a best effort, but lots of
requirements that are applicable to real *HTML5* documents (identified by
their prolog) do NOT apply to non-HTML5 documents as they are not directly
in scope of its standard (hte HTML4 specifications themselves are not
dismissed) : the best effort implies flexibility, even if interoperability
is not warrantied across HTML5 implementations that will all parse HTML4
documents but may still produce different results (inclusing with the
support of HTML4 quirk mode if they want).



2012/11/27 Masatoshi Kimura vyv03...@nifty.ne.jp

 (2012/11/27 20:27), Philippe Verdy wrote:
  HTML5 does not reference the Content-Type: text/html header as enough
  to qualify as meaning HTML5.
 HTML5 User-agents must parse any byte sequences as HTML5 document if the
 Content-Type is text/html.

  HTML5 **requires** its own prolog (i.e. its basic document declaration
  **within** the document itself, for the HTML syntax, or its FULL
  document declaration for the XML/XHTML syntax).
 HTML5 requires **authors** to write the prolog, not user-agents. Lacking
 prolog just turn the user-agents to quirks mode.
 Note that quirks mode doesn't mean do whatever you consider it quirks.
 Parsing quirks mode document is also completely spec'ed.

  So Firefox is wrong and attempts to use HTML5 to render all HTML
 dialects.
 No, not at all. Rather, it is required by the spec to use HTML5 parser
 to parse all byte sequences sent with Content-Type: text/html.
 Could you please stop spreading an unfounded rumor such as Firefox is
 wrong because it ignores the lacking of HTML5 prolog?

 --
 vyv03...@nifty.ne.jp

Re: xkcd: LTR

Philippe Verdy, Tue, 27 Nov 2012 15:39:43 +0100:
 I've never said that user agents had to 'write the prolog. It's the
 reverse: yes authors have to write a prolog (but the prolog is perfect here
 so this is not the fault of the author).

XML has (or more correctly: can have) a prolog. HTML does not have a 
prolog. Now to the million dollar question: is your page in question 
XML or HTML?  Answer: Per the Content-Type, then it is HTML (that is: 
text/html). Next question: Does the XML prolog have any effect when 
the XML file (more specifically: the XHTML file) is served as HTML 
(that is: text/html)? 

The answer is that, per HTML5, it does not have effect. And of course, 
per HTML4, it does not have effect. As for XHTML 1, then it cannot 
really regulate what is supposed to happen in text/html. The 
problem/challenge, hover is that some Web browsers - such as W3m (a 
text browser), Chrome, Opera and Safari - *do* look at the prolog for 
encoding info *also* when served as HTML. But Firefox and Internet 
Explorer do not. Which is according the HTML5 specification.

My guess is that it will *never* become conforming to use the XML 
prologue in HTML files. However, that does not necessarily prevent 
Firefox from looking at the prologue for encoding info, when *that* is 
the only source of encoding info. In fact, I think the HTML5 encoding 
sniffing algorithm already permits this (since it it has a step which 
roughly says if the user agent have other sources of information.)

So, for what it is worth - and with reference to your pages, I filed a 
bug against Firefox, to make it start to use the encoding declartion of 
the XML prologue, when nothing else is available: 
https://bugzilla.mozilla.org/show_bug.cgi?id=815279
-- 
leif halvard silli

Re: xkcd: LTR

2012-11-27 Thread Khaled Hosny

Looks OK here, but that is probably FreeType doing its magic as usual.

Regards,
Khaled

On Tue, Nov 27, 2012 at 02:29:45AM +0100, Philippe Verdy wrote:
Also I really don't like the Deseret font:
{font-family: CMU; src: url(CMUSerif-Roman.ttf) format(truetype);}
that you have inserted in your stylesheet (da.css) which is used to display
the whole text content of the page, including the English Latin text at the
bottom part. This downloaded font is difficult to read as it is not hinted
at all (so its rendering on screen is extremely poor, we probably don't
want to print each page of this XKCD series, when the main interest is the
image which is perfectly readable).
Could you ask to someone in this list to help you hinting this font a
minimum (even basic autohinting would be much better).

2012/11/27 Philippe Verdy verd...@wanadoo.fr

Did you try add the xml:lang=en-Dsrt pseudo-attribute to the html
element, as suggested by the W3C Unicorn validator ?

http://validator.w3.org/unicorn/check?ucn_uri=www.xn--elqus623b.net%2FXKCD%2F1138.htmlucn_lang=frucn_task=conformance#

May be this could help IE and Firefox that can't figure out the language
used to properly detect the encoding if they still don't trust the XML
declaration in this case, to avoid them to use an encoding guesser. It is
anyay curious because this site is valid as XHTML 1.1 (not as HTML5 which
uses a very different and simplified prolog, which is not matched here, so
the legacy rules should apply to detect XHTML here, then legacy HTML4 if
XHTML is no longer recognized by IE and Firefox). Because XHTML is properly
tagged, the XML requirements should apply and the XML declaration in the
prolog should be used without needing to guess the encoding from the rest
of the content (starting by a meta element in the HTML head element).

2012/11/27 John H. Jenkins jenk...@apple.com

That's because the domain does, in fact, use sinograms and not Deseret.
(It's my Chinese name.)

On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote:

I wonder why this IDN link appears to me using sinograms in its domain
name, instead of Deseret letters. The link works, but my browser cannot
display it and its displays the Punycoded name instead without decoding it.

This is strange because I do have Deseret fonts installed and I can
view Unicoded HTML pages containing Deseret letters.

2012/11/26 John H. Jenkins jenk...@apple.com

Or, if one prefers:

http://www.井作恆.net/XKCD/1137.htmlhttp://www.xn--elqus623b.net/XKCD/1137.html

On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com
wrote:

http://xkcd.com/1137/

Finally, an xkcd for Unicoders. :-)

Debbie

Re: xkcd: LTR

A ! I see now the problem: the XHTML file is being served as HTML
instead of XHTML (but this is not invalid for XHTML 1).

But anyway you're also right that the XML prolog found is NOT valid for
HTML5 when the file is served as HTML instead of XHTML. This should
immediately trigger the fact that HTML5 should not be used to render the
page in the HTML profile. So these browsers must find something else: given
the XML prolog they should then use HTML5 in its XHTML profile, not in its
HTML profile ; in this profile, they MUST honor the XML prolog and notably
its XML encoding declaration (given that the encoding is not specified in
the HTTP Content-type.

Now given the XML prolog and the DTD declaration, the file is clearly not
even HTML5 in XML/XHTML (i.e. XHTML 5), but is XHTML 1 (based on a stable
subset of HTML4, but working in strict mode without the quirks modes). Once
again, this excludes using the HTML5 rules again.

I'm still convinced that these are bugs in Firefox and IE, which support
only HTML5 in its basic HTML profile, but not HTML5 in its XML/XHTML
profile (which is also part of the HTML5 standard and where processing the
XML prolog is NOT an option but a requirement).


2012/11/27 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no

 Philippe Verdy, Tue, 27 Nov 2012 15:39:43 +0100:
  I've never said that user agents had to 'write the prolog. It's the
  reverse: yes authors have to write a prolog (but the prolog is perfect
 here
  so this is not the fault of the author).

 XML has (or more correctly: can have) a prolog. HTML does not have a
 prolog. Now to the million dollar question: is your page in question
 XML or HTML?  Answer: Per the Content-Type, then it is HTML (that is:
 text/html). Next question: Does the XML prolog have any effect when
 the XML file (more specifically: the XHTML file) is served as HTML
 (that is: text/html)?

 The answer is that, per HTML5, it does not have effect. And of course,
 per HTML4, it does not have effect. As for XHTML 1, then it cannot
 really regulate what is supposed to happen in text/html. The
 problem/challenge, hover is that some Web browsers - such as W3m (a
 text browser), Chrome, Opera and Safari - *do* look at the prolog for
 encoding info *also* when served as HTML. But Firefox and Internet
 Explorer do not. Which is according the HTML5 specification.

 My guess is that it will *never* become conforming to use the XML
 prologue in HTML files. However, that does not necessarily prevent
 Firefox from looking at the prologue for encoding info, when *that* is
 the only source of encoding info. In fact, I think the HTML5 encoding
 sniffing algorithm already permits this (since it it has a step which
 roughly says if the user agent have other sources of information.)

 So, for what it is worth - and with reference to your pages, I filed a
 bug against Firefox, to make it start to use the encoding declartion of
 the XML prologue, when nothing else is available:
 https://bugzilla.mozilla.org/show_bug.cgi?id=815279
 --
 leif halvard silli

Re: xkcd: LTR

No. Freetype is not involved here for the ugly rendering (on screen) under
Windows of the unhinted CMU font provided by the page. May be this looks
OK on Mac (if Safari is autohinting the font itself, despite the font is
not autohinted itself ; I'm not sure that Safari on MacOS processes TTF
fonts this way when they are not hinted, and I'm convinced that unhinted
fonts should not be autohinted magically by the renderer).

So using the xml:lang=en-Dsrt pseudo-attribute remains a good suggestion
to allow a CSS stylesheet to avoid using referening CMU font on Windows and
MacOS when displaying the Latin text (using xml:lang=en) and to allow the
same stylesheet to specify a much better Deseret font for Windows (Segoe UI
is fine on Windows). There will still remain a problem for redering the
page in Linux (where FreeType is used and which is not authinting itself
the unhinted font, and where Segoe UI is not available) and in Windows
before Windows 7 (no Segoe UI font as well, you'll also need a hinted
version of the CMU font).

2012/11/27 Khaled Hosny khaledho...@eglug.org

 Looks OK here, but that is probably FreeType doing its magic as usual.

Re: xkcd: LTR

Philippe Verdy, Tue, 27 Nov 2012 21:07:31 +0100:
 A ! I see now the problem: the XHTML file is being served as HTML 
 instead of XHTML (but this is not invalid for XHTML 1).

Both SGML-based HTML4 and XML-based XHTML 1 operate with syntax rules 
that are not - and has never been - compatible with the way text/html 
operates. Thus, both HTML4 and XHTML1 permits syntaxes whose semantics 
are ignored when the document is parsed as HTML (as opposed to parsed 
as SGML or as XML).

If you you are interested in creating XHTML syntax that is compatible 
with HTML, then you should look at Polyglot Markup: 
http://www.w3.org/TR/html-polyglot/

 But anyway you're also right that the XML prolog found is NOT valid 
 for HTML5 when the file is served as HTML instead of XHTML.

The fact that XHTML 1 permits the XML prolog regardless how the 
document is served, is just a shortcoming of the XHTML 1 specification.

 So these browsers must find 
 something else: given the XML prolog they should then use HTML5 in 
 its XHTML profile, not in its HTML profile

No, that is not how things works. The decision to parse the document as 
HTML is taken before the browser sees the XML prologue. So the prologue 
should not - and does not - change anything with regard to parsing as 
HTML or as XML.

 ; in this profile, they 
 MUST honor the XML prolog and notably its XML encoding declaration 
 (given that the encoding is not specified in the HTTP Content-type.

Again: Absolutely not. They must not, will not and must not honour the 
XML prologue. (It is another matter that some user agents sometimes use 
the prologue to look for encoding information.)

 Now given the XML prolog and the DTD declaration, the file is clearly 
 not even HTML5 in XML/XHTML (i.e. XHTML 5), but is XHTML 1 (based on 
 a stable subset of HTML4, but working in strict mode without the 
 quirks modes). Once again, this excludes using the HTML5 rules again.

In a way the names and the numbers (HTML4, XHTML1, HTML5) are just 
confusing. There is just one way to parse HTML. When it comes to HTML 
(text/html),then HTML5 differs from HTML4 and XHTML1 in that it is not 
based on a *another* format than HTML itself. Because HTML4 and XHTML1 
are not based on how HTML actually works, and - in addition - does not 
take fully account of that (or whatever the reason), they allow 
syntaxes, such as DTD declarations, which have no effect (except 
side-effects such as quirks-mode) in HTML.

 I'm still convinced that these are bugs in Firefox and IE, which 
 support only HTML5 in its basic HTML profile, but not HTML5 in its 
 XML/XHTML profile (which is also part of the HTML5 standard and where 
 processing the XML prolog is NOT an option but a requirement).

Just for the record: HTML5 defines the most up-to-date parsing 
mechanism for *all* HTML documents - HTML1,2,3,5 as well as any flavour 
of XHTML served as HTML. HTML5 does not allow authors to use the XML 
prologue. So while XHTML1 allows you to use the prologue, the best 
description of how to parse anything that purports to be HTML -  HTML5 
- does not require user agents/browsers to pay any attention to the 
prologue. Thus the correct one to blame in this case for the fact that 
it doesn't work in Firefox, seems to be the author. (Though we could 
also blame the The history of how HTML developed.
-- 
leif halvard silli

Re: xkcd: LTR

2012-11-27 Thread Asmus Freytag


On 11/27/2012 5:39 AM, Masatoshi Kimura wrote:

(2012/11/27 20:27), Philippe Verdy wrote:
Could you please stop spreading an unfounded rumor such as Firefox is 
wrong because it ignores the lacking of HTML5 prolog? 


Getting Philippe to stop spreading unfounded anything is a near 
impossible task. :)


A./

Re: xkcd: LTR

2012/11/27 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no


 The fact that XHTML 1 permits the XML prolog regardless how the
 document is served, is just a shortcoming of the XHTML 1 specification.


No, it was by design. Making HTML an application of XML. Only XML but
with all rules of XML.

 So these browsers must find
  something else: given the XML prolog they should then use HTML5 in
  its XHTML profile, not in its HTML profile

 No, that is not how things works. The decision to parse the document as
 HTML is taken before the browser sees the XML prologue. So the prologue
 should not - and does not - change anything with regard to parsing as
 HTML or as XML.


Then explain why the W3C validator sees absolutley no problem in the way
these XHTML1 pages are encoded and transported.


   ; in this profile, they
  MUST honor the XML prolog and notably its XML encoding declaration
  (given that the encoding is not specified in the HTTP Content-type.

 Again: Absolutely not. They must not, will not and must not honour the
 XML prologue. (It is another matter that some user agents sometimes use
 the prologue to look for encoding information.)


Sure they can because this XHTML1 site violates  HTML5 rules, missing its
required prologue.

  Now given the XML prolog and the DTD declaration, the file is clearly
  not even HTML5 in XML/XHTML (i.e. XHTML 5), but is XHTML 1 (based on
  a stable subset of HTML4, but working in strict mode without the
  quirks modes). Once again, this excludes using the HTML5 rules again.

 In a way the names and the numbers (HTML4, XHTML1, HTML5) are just
 confusing. There is just one way to parse HTML. When it comes to HTML
 (text/html),then HTML5 differs from HTML4 and XHTML1 in that it is not
 based on a *another* format than HTML itself. Because HTML4 and XHTML1
 are not based on how HTML actually works, and - in addition - does not
 take fully account of that (or whatever the reason), they allow
 syntaxes, such as DTD declarations, which have no effect (except
 side-effects such as quirks-mode) in HTML.


HTML5 admits the two syntaxes : SGML-based like it is used primarily (in a
simplified profile), and XML.


   I'm still convinced that these are bugs in Firefox and IE, which
  support only HTML5 in its basic HTML profile, but not HTML5 in its
  XML/XHTML profile (which is also part of the HTML5 standard and where
  processing the XML prolog is NOT an option but a requirement).

 Just for the record: HTML5 defines the most up-to-date parsing
 mechanism for *all* HTML documents - HTML1,2,3,5 as well as any flavour
 of XHTML served as HTML. HTML5 does not allow authors to use the XML
 prologue.


Where ? The required HTML5 prolog applies to its SGML based syntax ; it
makes no sense in XHTML as it voluntarily violates the validity of the XML
document declaration.

The absence of the HTML5 required prolog (in its standard basic-SGML
profile), or the presence of another incompatible XML prolog is enough to
make the distinction between the two syntaxes. But both syntaxes will
generate the same HTML DOM, which is just enough to make the proper
rendering intended, and make HTML5 compatible with both syntaxes.

Now HTML5 is still not completely polished, finished and approved. Such
interoperability rules are not clearly defined even if they are the most
up-to-date to make it work seamlessly with the claimed compatibility with
all flavors of HTML or XHTML. And the fact that Firefox and IE behave
differently from Chorme and Safari in this domain is a proof of this
unfinished status.

Re: xkcd: LTR

Philippe Verdy, Wed, 28 Nov 2012 01:10:45 +0100:
 2012/11/27 Leif Halvard Silli

 The fact that XHTML 1 permits the XML prolog regardless how the
 document is served, is just a shortcoming of the XHTML 1 specification.
 
 No, it was by design. Making HTML an application of XML. Only XML but 
 with all rules of XML.

It was by design. But nevertheless a shortcoming. They should/could 
have defined more restrictions on the syntax than then they did, and 
then it would have been OK. But don't forget that XHTML1 also permits 
you to use the meta element - which works in all web browsers, for 
setting the encoding:

meta http-equiv=Content-Type content=text/html; charset=UTF-8 /

This is described in the famous Appendix C of XHTML 1: 
http://www.w3.org/TR/xhtml1/#C_9

 So these browsers must find
 something else: given the XML prolog they should then use HTML5 in
 its XHTML profile, not in its HTML profile
 
 No, that is not how things works. The decision to parse the document as
 HTML is taken before the browser sees the XML prologue. So the prologue
 should not - and does not - change anything with regard to parsing as
 HTML or as XML.
 
 Then explain why the W3C validator sees absolutley no problem in the 
 way these XHTML1 pages are encoded and transported.

Because it only checks the syntax, without asking you how you are 
actually going to use that syntax - whether you want to serve it to an 
XML parser as XHTML or you are going to serve it to an HTML parser. For 
a new version of the validator, that ask more of those questions, 
please try http://validator.w3.org/nu/  - it happens to for the most 
part be developed by one of the Firefox developers, btw. And it allows 
you to check XHTML1-syntax as well (but only if you serve it as XHTML - 
if you serve it as HTML, then it validates it as HTML.)

 ; in this profile, they
 MUST honor the XML prolog and notably its XML encoding declaration
 (given that the encoding is not specified in the HTTP Content-type.
 
 Again: Absolutely not. They must not, will not and must not honour the
 XML prologue. (It is another matter that some user agents sometimes use
 the prologue to look for encoding information.)
 
 Sure they can because this XHTML1 site violates  HTML5 rules, missing 
 its required prologue.

Not sure how you understand the phrase honour the XML prologue. It 
also sounds as if you say that HTML5 has its own prologue. But HTML5 
does not contain any code that is commonly known as prologue. For 
instance, if you refer to the code !DOCTYPE html, then this is not 
a prologue even if it occurs at the start of the document. 

Also, since there are two flavours of XML - XML 1.0 and XML 1.1, the 
prologue may potentially have an effect on how the document is parsed, 
but only if the parser already knows that the file is XML. But the XML 
prologue does not *cause* parsers to choose XML-mode rather than 
HTML-mode.

(Opera introduced the opposite thing some time ago: If the document is 
an XHTML document - for real, but contains XML wellformedness errors, 
then it will switch to HTML-mode.)

 Now given the XML prolog and the DTD declaration, the file is clearly
 not even HTML5 in XML/XHTML (i.e. XHTML 5), but is XHTML 1 (based on
 a stable subset of HTML4, but working in strict mode without the
 quirks modes). Once again, this excludes using the HTML5 rules again.
 
 In a way the names and the numbers (HTML4, XHTML1, HTML5) are just
 confusing. There is just one way to parse HTML. When it comes to HTML
 (text/html),then HTML5 differs from HTML4 and XHTML1 in that it is not
 based on a *another* format than HTML itself. Because HTML4 and XHTML1
 are not based on how HTML actually works, and - in addition - does not
 take fully account of that (or whatever the reason), they allow
 syntaxes, such as DTD declarations, which have no effect (except
 side-effects such as quirks-mode) in HTML.
 
 HTML5 admits the two syntaxes : SGML-based like it is used primarily 
 (in a simplified profile), and XML.

From one angle, you are off course right. But HTML5 actually explains 
that what you call SGML-based is not SGML-based but only SGML 
*inspired*. Thus, HTML5 is much simpler and less cryptic than the 
(official) SGML syntax of HTML4.

 I'm still convinced that these are bugs in Firefox and IE, which
 support only HTML5 in its basic HTML profile, but not HTML5 in its
 XML/XHTML profile (which is also part of the HTML5 standard and where
 processing the XML prolog is NOT an option but a requirement).
 
 Just for the record: HTML5 defines the most up-to-date parsing
 mechanism for *all* HTML documents - HTML1,2,3,5 as well as any flavour
 of XHTML served as HTML. HTML5 does not allow authors to use the XML
 prologue. 
 
 Where ? 

Here: http://dev.w3.org/html5/spec/syntax.html#writing (As you can see, 
it doesn't say that it is allowed, hence it is not.) You can also see 
the bottom of this page: 
http://dev.w3.org/html5/spec/the-meta-element.html#charset

 The required

Re: xkcd: LTR

2012/11/28 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no

 For
 a new version of the validator, that ask more of those questions,
 please try http://validator.w3.org/nu/  - it happens to for the most
 part be developed by one of the Firefox developers, btw. And it allows
 you to check XHTML1-syntax as well (but only if you serve it as XHTML -
 if you serve it as HTML, then it validates it as HTML.)


This new validator is not the one promoted and supported. I use the
Unicorn validator that checks all W2C supported markup languages
(including HTML5).

 ; in this profile, they
  MUST honor the XML prolog and notably its XML encoding declaration
  (given that the encoding is not specified in the HTTP Content-type.
 
  Again: Absolutely not. They must not, will not and must not honour the
  XML prologue. (It is another matter that some user agents sometimes use
  the prologue to look for encoding information.)
 
  Sure they can because this XHTML1 site violates  HTML5 rules, missing
  its required prologue.

 Not sure how you understand the phrase honour the XML prologue. It
 also sounds as if you say that HTML5 has its own prologue. But HTML5
 does not contain any code that is commonly known as prologue. For
 instance, if you refer to the code !DOCTYPE html, then this is not
 a prologue even if it occurs at the start of the document.


Question of terminology specific to this version, I consider it part of the
prolog, and it is not valid XML, so not valid XHTML.


 From one angle, you are off course right. But HTML5 actually explains
 that what you call SGML-based is not SGML-based but only SGML
 *inspired*. Thus, HTML5 is much simpler and less cryptic than the
 (official) SGML syntax of HTML4.


It is evident that here I mean the legacy HTML syntax, not compatible with
XML (it allows closing tags, and does not require self-closed tags for
empty elements).


  I'm still convinced that these are bugs in Firefox and IE, which
  support only HTML5 in its basic HTML profile, but not HTML5 in its
  XML/XHTML profile (which is also part of the HTML5 standard and where
  processing the XML prolog is NOT an option but a requirement).
 
  Just for the record: HTML5 defines the most up-to-date parsing
  mechanism for *all* HTML documents - HTML1,2,3,5 as well as any flavour
  of XHTML served as HTML. HTML5 does not allow authors to use the XML
  prologue.
 
  Where ?

 Here: http://dev.w3.org/html5/spec/syntax.html#writing (As you can see,
 it doesn't say that it is allowed, hence it is not.) You can also see
 the bottom of this page:
 http://dev.w3.org/html5/spec/the-meta-element.html#charset

  The required HTML5 prolog applies to its SGML based syntax ;

 Please note that prolog is one thing, and the DOCTYPE is another, see
 XML 1.0: http://www.w3.org/TR/REC-xml/#sec-prolog-dtd


Yes I know the terminolgy, but it's evident that I'm including the document
declaration as part of the prolog (i.e. everything that is not comment
and that appears before the root element)


  it makes no sense in XHTML as it voluntarily violates the validity of
  the XML document declaration.

 If you are speaking about the HTML5 doctype, then its only effect is to
 make sure that the HTML parser stays in no-quirks (aka standards) mode.
 In XHTML then, you are right that it is not needed. But you are wrong
 if you say that it is a problem to include it in XHTML, as it causes no
 harm. In fact, in XHTML, you can drop both the DOCTYPE and the XML
 prologue.

  The absence of the HTML5 required prolog (in its standard basic-SGML
  profile), or the presence of another incompatible XML prolog is
  enough to make the distinction between the two syntaxes.

 You mean: Visually? Yes. However, that is not how parsers think. What
 parsers normally do is that they look at the Content-Type flag,
 before they decide how to parse the document.


True, but then when the HTML5 parser detects a violation of the required
extended prolog (sorry, the HTML5 document declaration, which is not a
valid document declaration for XHTML or for HTML4 or before or even for
SGML, due to the unspecified schema after the shema short name), it should
catch this exception to try another parser. The XML declaration itself is
enough to throw the exception and so easy to detect to allow changing from
an HTML parser to an XML parser for XHTML. If even the XML parser fails,
then retry with a legacy HTML parser working in quirks mode.

  Now HTML5 is still not completely polished, finished and approved.

   Such interoperability rules are not clearly defined even if they are
  the most up-to-date to make it work seamlessly with the claimed
  compatibility with all flavors of HTML or XHTML. And the fact that
  Firefox and IE behave differently from Chorme and Safari in this
  domain is a proof of this unfinished status.

 I would not conclude like that … But it could probably have saved us
 this discussion if Firefox/IE, like the other dominating browsers, did
 use it as a

Re: xkcd: LTR

 detects a violation of the required

  extended prolog (sorry, the HTML5 document declaration, which is not a
  valid document declaration for XHTML or for HTML4 or before or even for
  SGML, due to the unspecified schema after the shema short name), it
 should
  catch this exception to try another parser.

 There is no spec, that I am aware of, that says that it should do that.


But this is in the scope of the HTML5 whose claimed purpose is to become
compatible with documents encoded in all previous flavors of HTML.
Otherwise this claim is very weak and HTML5 is just a standard compatible
with itself, and nothing else (it breaks XHTML rules, and SGML rules for
the document declaration, and IETF charset naming rules with its
reinterpretation of ISO8859-1, which is also still not stabilized).

HTML5 is still beta in these claims, and it's regrettable that its required
document declaration does not even specify its SGML catalog entry name,
even if it forbids the insertion of a DTD. One day or another, at least the
SGML catalog entry name will come back, when HTML5 will have been released
and a newer version will be needed and developed, and HTML5 should still
allow the presence of this SGML catalog entry name, even if it does not
require it in this version.

Re: xkcd: LTR

Philippe Verdy, Wed, 28 Nov 2012 04:23:10 +0100:
 2012/11/28 Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no
 
 For
 a new version of the validator, that ask more of those questions,
 please try http://validator.w3.org/nu/  - it happens to for the most
 part be developed by one of the Firefox developers, btw. And it allows
 you to check XHTML1-syntax as well (but only if you serve it as XHTML -
 if you serve it as HTML, then it validates it as HTML.)
 
 This new validator is not the one promoted and supported. I use the
 Unicorn validator that checks all W2C supported markup languages
 (including HTML5).

The nu validator is good if you are interested in the questions I 
mentioned above.

 Please note that prolog is one thing, and the DOCTYPE is another, see
 XML 1.0: http://www.w3.org/TR/REC-xml/#sec-prolog-dtd
 
 Yes I know the terminolgy, but it's evident that I'm including the document
 declaration as part of the prolog (i.e. everything that is not comment
 and that appears before the root element)

It is just as confusing as ever that you continue to insist on your 
terminology.

 The absence of the HTML5 required prolog (in its standard basic-SGML
 profile), or the presence of another incompatible XML prolog is
 enough to make the distinction between the two syntaxes.
 
 You mean: Visually? Yes. However, that is not how parsers think. What
 parsers normally do is that they look at the Content-Type flag,
 before they decide how to parse the document.
 
 True, but then when the HTML5 parser

The HTML5 parser is just the one and only (updated) HTML parser.

 detects a violation of the required
 extended prolog (sorry, the HTML5 document declaration, which is not a
 valid document declaration for XHTML or for HTML4 or before or even for
 SGML, due to the unspecified schema after the shema short name), it should
 catch this exception to try another parser.

There is no spec, that I am aware of, that says that it should do that.
-- 
leif halvard silli

Re: xkcd: LTR

Philippe Verdy, Wed, 28 Nov 2012 04:50:06 +0100:
 detects a violation of the required
 extended prolog (sorry, the HTML5 document declaration, which is not a
 valid document declaration for XHTML or for HTML4 or before or even for
 SGML, due to the unspecified schema after the shema short name), it
 should catch this exception to try another parser.
 
 There is no spec, that I am aware of, that says that it should do that. 
 
 But this is in the scope of the HTML5 whose claimed purpose is to become
 compatible with documents encoded in all previous flavors of HTML.

I admit that understanding the meaning behind all the slogans about 
HTML5, can be be demanding. But the goal has all the time been to 
create a *single* HTML parser, and not to introduce switching between 
multiple HTML parsers. If you think otherwise, then my claim is that 
you have misunderstood.

 Otherwise this claim is very weak and HTML5 is just a standard compatible
 with itself,

Yes, HTML5 is a standard in itself. For instance, the issue of the XML 
prologue have been mentioned from time to time during the HTML5 
process, but a deliberate choice was made to not accept it as part of 
the syntax. Probably one of the motivations for why the editor made 
that choice was to help authors to keep HTML and XHTML separate. Also, 
HTML5 contains some willful violations of other standards. But then, a 
standard is supposed to set a new standard, hence that should in 
principle be OK. But it is true that terms such as Web compatible and 
compatible in general have been used sloganishly about HTML5. I 
think, in one way, it was just a method for getting things to move. But 
it is not so that compatible has trumped every other HTML5 design 
options - other things to consider are for instance that the end result 
- then final syntax -  becomes simple to understand, without too 
complicated and convoluted rules.

Just my two cents, about how I see it.
-- 
leif halvard silli

Re: xkcd: ‮LTR

2012-11-26 Thread John H. Jenkins

Or, if one prefers:

http://www.井作恆.net/XKCD/1137.html

On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote:

 
 http://xkcd.com/1137/ 
 
 Finally, an xkcd for Unicoders. :-)
 
 Debbie

RE: xkcd: ‮LTR

2012-11-26 Thread Marc Durdin

Somewhat ironically, both Firefox and Internet Explorer, on my machine at 
least, detect this page is encoded with ISO-8859-1 and cp-1252 respectively, 
instead of UTF-8.  It seems they both ignore the XML prolog which is the only 
place where the encoding is stated.

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of John H. Jenkins
Sent: Tuesday, 27 November 2012 1:15 AM
To: Unicode Mailing List
Subject: Re: xkcd: LTR

Or, if one prefers:

http://www.井作恆.net/XKCD/1137.html

On 2012年11月21日, at 上午10:22, Deborah Goldsmith 
golds...@apple.commailto:golds...@apple.com wrote:

http://xkcd.com/1137/

Finally, an xkcd for Unicoders. :-)

Debbie

Re: xkcd: LTR

I wonder why this IDN link appears to me using sinograms in its domain
name, instead of Deseret letters. The link works, but my browser cannot
display it and its displays the Punycoded name instead without decoding it.

This is strange because I do have Deseret fonts installed and I can
view Unicoded HTML pages containing Deseret letters.


2012/11/26 John H. Jenkins jenk...@apple.com

 Or, if one prefers:

 http://www.井作恆.net/XKCD/1137.html

 On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote:


 http://xkcd.com/1137/

 Finally, an xkcd for Unicoders. :-)

 Debbie

Re: xkcd: LTR

Not a bug of your machine or browser; this is a problem of the webserver in
its metadata.
The transport layer indicates to the client another encoding in HTTP
headers, and it prevails to what the document encodes.
In this case, the webserver should be able to transform the source document
to match what it indicates in HTTP headers, or should better identidy its
local file contents to send the correct HTTP header).

Send a bug report to the site admin to fix its web server settings,
possibly per directory, or using a naming scheme for webpages that are
encoded differently, e.g. http://www.example.net/path/to/file.UTF-8.html;
will request the content of a file named file.UTF-8.html with an explicit
extension *.UTF-8.html which can be mapped by the server using another
HTTP header for the effective UTF-8 encoding (instead of using cp-1252).

My opinion however is that new contents should always be encoded in UTF-8,
and older contents may be linked to another effective archiving directory
where it can be mapped to the older encoding without having to reencode the
old content.

2012/11/26 Marc Durdin marc.dur...@tavultesoft.com

  Somewhat ironically, both Firefox and Internet Explorer, on my machine
 at least, detect this page is encoded with ISO-8859-1 and cp-1252
 respectively, instead of UTF-8.  It seems they both ignore the XML prolog
 which is the only place where the encoding is stated.

 ** **

 *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On
 Behalf Of *John H. Jenkins
 *Sent:* Tuesday, 27 November 2012 1:15 AM
 *To:* Unicode Mailing List
 *Subject:* Re: xkcd: LTR

 ** **

 Or, if one prefers:

 ** **

 http://www.井作恆.net/XKCD/1137.html

 ** **

 On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote:*
 ***



 


 http://xkcd.com/1137/


 

 Finally, an xkcd for Unicoders. :-)



 

 Debbie



 

 ** **

Re: xkcd: LTR

2012-11-26 Thread John H. Jenkins

That's because the domain does, in fact, use sinograms and not Deseret.  (It's 
my Chinese name.)

On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote:

 I wonder why this IDN link appears to me using sinograms in its domain name, 
 instead of Deseret letters. The link works, but my browser cannot display it 
 and its displays the Punycoded name instead without decoding it.
 
 This is strange because I do have Deseret fonts installed and I can view 
 Unicoded HTML pages containing Deseret letters.
 
 
 2012/11/26 John H. Jenkins jenk...@apple.com
 Or, if one prefers:
 
 http://www.井作恆.net/XKCD/1137.html
 
 On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote:
 
 
 http://xkcd.com/1137/ 
 
 Finally, an xkcd for Unicoders. :-)
 
 Debbie

RE: xkcd: LTR

2012-11-26 Thread Marc Durdin

In this instance the web server is not returning an encoding (“Content-Type: 
text/html”), which is why I was curious to see that neither web browser picked 
up the UTF-8 hint in the XML prolog.
Chrome does detect UTF-8 for that page.

From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy
Sent: Tuesday, 27 November 2012 7:49 AM
To: Marc Durdin
Cc: John H. Jenkins; Unicode Mailing List
Subject: Re: xkcd: LTR

Not a bug of your machine or browser; this is a problem of the webserver in its 
metadata.
The transport layer indicates to the client another encoding in HTTP headers, 
and it prevails to what the document encodes.
In this case, the webserver should be able to transform the source document to 
match what it indicates in HTTP headers, or should better identidy its local 
file contents to send the correct HTTP header).

Send a bug report to the site admin to fix its web server settings, possibly 
per directory, or using a naming scheme for webpages that are encoded 
differently, e.g. http://www.example.net/path/to/file.UTF-8.html; will request 
the content of a file named file.UTF-8.html with an explicit extension 
*.UTF-8.html which can be mapped by the server using another HTTP header for 
the effective UTF-8 encoding (instead of using cp-1252).

My opinion however is that new contents should always be encoded in UTF-8, and 
older contents may be linked to another effective archiving directory where it 
can be mapped to the older encoding without having to reencode the old content.

2012/11/26 Marc Durdin 
marc.dur...@tavultesoft.commailto:marc.dur...@tavultesoft.com
Somewhat ironically, both Firefox and Internet Explorer 9, on my machine at 
least, detect this page is encoded with ISO-8859-1 and cp-1252 respectively, 
instead of UTF-8.  It seems they both ignore the XML prolog which is the only 
place where the encoding is stated.
From: unicode-bou...@unicode.orgmailto:unicode-bou...@unicode.org 
[mailto:unicode-bou...@unicode.orgmailto:unicode-bou...@unicode.org] On 
Behalf Of John H. Jenkins
Sent: Tuesday, 27 November 2012 1:15 AM
To: Unicode Mailing List
Subject: Re: xkcd: LTR

Or, if one prefers:

http://www.井作恆.net/XKCD/1137.html

On 2012年11月21日, at 上午10:22, Deborah Goldsmith 
golds...@apple.commailto:golds...@apple.com wrote:

http://xkcd.com/1137/

Finally, an xkcd for Unicoders. :-)

Debbie

Re: xkcd: LTR

Also I really don't like the Deseret font:
{font-family: CMU; src: url(CMUSerif-Roman.ttf) format(truetype);}
that you have inserted in your stylesheet (da.css) which is used to display
the whole text content of the page, including the English Latin text at the
bottom part. This downloaded font is difficult to read as it is not hinted
at all (so its rendering on screen is extremely poor, we probably don't
want to print each page of this XKCD series, when the main interest is the
image which is perfectly readable).
Could you ask to someone in this list to help you hinting this font a
minimum (even basic autohinting would be much better).

2012/11/27 Philippe Verdy verd...@wanadoo.fr

Did you try add the xml:lang=en-Dsrt pseudo-attribute to the html
element, as suggested by the W3C Unicorn validator ?

http://validator.w3.org/unicorn/check?ucn_uri=www.xn--elqus623b.net%2FXKCD%2F1138.htmlucn_lang=frucn_task=conformance#

2012/11/27 John H. Jenkins jenk...@apple.com

That's because the domain does, in fact, use sinograms and not Deseret.
(It's my Chinese name.)

On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote:

This is strange because I do have Deseret fonts installed and I can
view Unicoded HTML pages containing Deseret letters.

2012/11/26 John H. Jenkins jenk...@apple.com

Or, if one prefers:

http://www.井作恆.net/XKCD/1137.htmlhttp://www.xn--elqus623b.net/XKCD/1137.html

On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com
wrote:

http://xkcd.com/1137/

Finally, an xkcd for Unicoders. :-)

Debbie

Re: xkcd: LTR

Did you try add the xml:lang=en-Dsrt pseudo-attribute to the html
element, as suggested by the W3C Unicorn validator ?

http://validator.w3.org/unicorn/check?ucn_uri=www.xn--elqus623b.net%2FXKCD%2F1138.htmlucn_lang=frucn_task=conformance#

2012/11/27 John H. Jenkins jenk...@apple.com

That's because the domain does, in fact, use sinograms and not Deseret.
(It's my Chinese name.)

On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote:

This is strange because I do have Deseret fonts installed and I can
view Unicoded HTML pages containing Deseret letters.

2012/11/26 John H. Jenkins jenk...@apple.com

Or, if one prefers:

http://www.井作恆.net/XKCD/1137.htmlhttp://www.xn--elqus623b.net/XKCD/1137.html

On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote:

http://xkcd.com/1137/

Finally, an xkcd for Unicoders. :-)

Debbie

Re: xkcd: LTR