Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-07-31 Thread Ian Hickson
On Thu, 4 Jul 2013, Michael Day wrote:
> > 
> > The problem is that we can't do (2) in _all_ cases, e.g. innerHTML on 
> > an  can't possibly break out of the  if it sees one of these 
> > tags, since that's the "root" of what is being parsed.
> 
> Yes, HTML has already lost the composability of parsing that XML and 
> other languages have, that's long gone. But that doesn't mean we should 
> try to make it even more irregular :)
> 
> Currently Firefox, Chrome, and Prince all treat the fragment case the 
> same as the whole document case, so we already have interoperable 
> behaviour on this issue.

If you treated them the same, you would either crash or have an infinite 
loop, because you'd either pop the root element off the stack and then try 
to append something to null, or you'd try to reprocess the token without 
having popped anything first.

There has to be _some_ special casing of .innerHTML.

What should the special casing be? Consider this case:

   .innerHTML = ''

I can see two possible options:

svg
|
+-- g
|
+-- P

Or:

svg
|
+-- g
|
+-- P

Neither are what happens in the non-fragment case (in that case the  is 
a sibling of the ).

Consider this case:

   .innerHTML = ''

Here, the  node could be a child of the innermost , the innermost 
, the outermost , or the outermost . I could see arguments 
for all those cases. It seems unlikely that the author meant any of them.
 

> Since the HTML spec is supposed to reflect reality, it seems pointless 
> to deliberately introduce an inconsistency in the parsing model that 
> requires changes in all user agents to implement.

All the user agents (or at least, all the browsers I could test) have to 
change anyway. Blink-based browsers and WebKit-based browsers don't 
support innerHTML on  at all. Firefox supports innerHTML on  but 
puts all the nodes in the HTML namespace.

In conclusion, the reason I simply removed the quirk from fragment parsing 
rather than trying to make it work is that:

 - all browsers will have to change anyway,

 - the quirk needs special handling in the fragment case anyway,

 - it's not clear what the behaviour should be,

 - in many cases, we're not error-correcting in a useful way anyway.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-07-03 Thread Michael Day

Hi Ian,


The problem is that we can't do (2) in _all_ cases, e.g. innerHTML on an
 can't possibly break out of the  if it sees one of these tags,
since that's the "root" of what is being parsed.


Yes, HTML has already lost the composability of parsing that XML and 
other languages have, that's long gone. But that doesn't mean we should 
try to make it even more irregular :)


Currently Firefox, Chrome, and Prince all treat the fragment case the 
same as the whole document case, so we already have interoperable 
behaviour on this issue.


Since the HTML spec is supposed to reflect reality, it seems pointless 
to deliberately introduce an inconsistency in the parsing model that 
requires changes in all user agents to implement.


Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-07-03 Thread Ian Hickson
On Thu, 4 Jul 2013, Michael Day wrote:
> > 
> > We don't have any data that says that we need to support this for 
> > innerHTML. I think it's a win if we can drop the hack from innerHTML.
> 
> Okay, so allowing some HTML elements to break out of foreign content is 
> a hack added for historical reasons, that will surprise authors and 
> complicate implementations and is thus regrettable, but necessary.
> 
> Then there are two possibilities for fragment parsing:
> 
> (1) The hack can be left out of fragment parsing, as there is no 
> historical justification for it. Since the hack is bad, removing it from 
> as many situations as possible is good.
> 
> (2) The hack can apply to fragment parsing in the same way as it applies 
> to regular parsing. This makes parsing behaviour more consistent across 
> different situations, which is good.
> 
> I'm strongly in favour of (2), as it seems that omitting the hack from 
> some rare situations doesn't save authors any trouble, and doesn't 
> follow the principle of least surprise.

The problem is that we can't do (2) in _all_ cases, e.g. innerHTML on an 
 can't possibly break out of the  if it sees one of these tags, 
since that's the "root" of what is being parsed.

Given that, it's not clear that (2) is better than (1). (I agree that if 
we could actually always be consistent, it would be.)

Note that this isn't the only place like that.

   

   

...and:

   document.createElement('table').innerHTML = '';

...result in very different DOMs (in the first, the  and the 
 are siblings; in the latter, the  is a child).


> In an ideal world it would be possible to grab any subsection of a 
> document, parse that in isolation as a fragment, and get the same result 
> as if it was parsed in its original document context. This is possible 
> in XML, but not HTML, due to the existing "author-friendly" hacks, and 
> making the parsing behaviour even more context sensitive doesn't seem 
> like a good thing.

I think we're _so_ far beyond this ideal world that I'm not sure it's 
worth even looking for it, to be honest. :-)

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-07-03 Thread Michael Day

Hi Ian,


We don't have any data that says that we need to support this for
innerHTML. I think it's a win if we can drop the hack from innerHTML.


Okay, so allowing some HTML elements to break out of foreign content is 
a hack added for historical reasons, that will surprise authors and 
complicate implementations and is thus regrettable, but necessary.


Then there are two possibilities for fragment parsing:

(1) The hack can be left out of fragment parsing, as there is no 
historical justification for it. Since the hack is bad, removing it from 
as many situations as possible is good.


(2) The hack can apply to fragment parsing in the same way as it applies 
to regular parsing. This makes parsing behaviour more consistent across 
different situations, which is good.


I'm strongly in favour of (2), as it seems that omitting the hack from 
some rare situations doesn't save authors any trouble, and doesn't 
follow the principle of least surprise.


In an ideal world it would be possible to grab any subsection of a 
document, parse that in isolation as a fragment, and get the same result 
as if it was parsed in its original document context. This is possible 
in XML, but not HTML, due to the existing "author-friendly" hacks, and 
making the parsing behaviour even more context sensitive doesn't seem 
like a good thing.


Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-07-02 Thread Ian Hickson
On Tue, 2 Jul 2013, Michael Day wrote:
> 
> The new text reads:
> 
> "If the parser was originally created for the HTML fragment parsing algorithm,
> then act as described in the "any other start tag" entry below. (fragment
> case)"
> 
> This always just adds the HTML element in place inside the foreign content,
> even if the fragment context element *is* a HTML element!

Right, that's the intent.

This specific clause is a hack to make certain elements break out of 
foreign content, because we found some pages that do crazy stuff like:

   Bla bla
   
   Bla bla

...which, prior to SVG being added to HTML, would show two paragraphs, but 
if we didn't have this hack, it would now just end the page at the  tag.


> This can't be right, as it means parsing document.body.innerHTML will 
> behave totally differently to parsing , for no reason.

Not totally differently, only differently in the specific cases of these 
few tags that trigger this wacked behaviour in markup that's broken anyway.

We don't have any data that says that we need to support this for 
innerHTML. I think it's a win if we can drop the hack from innerHTML.


> Looking back a couple of years, this section of the spec seems to be 
> drifting in a random walk away from reality. We can study this further 
> and try suggesting some text based on what we have implemented so far.

Well, when it started it wasn't reality at all, since there was no foreign 
content support in text/html. :-)

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-07-02 Thread Michael Day

Hi Ian,


I ended up removing this from the spec for other reasons, so this should
be resolved now. Let me know if it's not.

(No, I don't know what I had originally intended.)


I don't think the new spec is correct. The question is what happens if 
we are tokenizing some foreign content, and we see an HTML start tag.


In the normal case, we pop off all the foreign elements until we get 
back to the HTML namespace, then reprocess the token.


In the fragment case, the context element may be a foreign element, so 
there was the wrinkle of having to handle that appropriately when we 
have this fake "root"  element that makes everything confusing.


The new text reads:

"If the parser was originally created for the HTML fragment parsing 
algorithm, then act as described in the "any other start tag" entry 
below. (fragment case)"


This always just adds the HTML element in place inside the foreign 
content, even if the fragment context element *is* a HTML element!


This can't be right, as it means parsing document.body.innerHTML will 
behave totally differently to parsing , for no reason.


Looking back a couple of years, this section of the spec seems to be 
drifting in a random walk away from reality. We can study this further 
and try suggesting some text based on what we have implemented so far.


Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-07-01 Thread Ian Hickson
On Thu, 18 Apr 2013, Michael Day wrote:
> 
> Another issue regarding recent changes to 12.2.5.5 "The rules for 
> parsing tokens in foreign content".
> 
> When a HTML start tag is seen (specifically "b", "big", "blockquote", 
> "body", "br", "center", "code", ...) the following procedure is given to 
> recover from the parse error:
> 
> """
> If the stack of open elements does not have an element in scope that is a
> MathML text integration point, an HTML integration point, or an element in the
> HTML namespace, or if the stack of open elements has only one element, then
> process the token using the rules for the "in body" insertion mode. (fragment
> case)
> """
> 
> Since the stack of open elements always has  at the top of the 
> stack, the "element in scope" algorithm will always find it, and as a 
> result, the first part of the condition will always fail.

I ended up removing this from the spec for other reasons, so this should 
be resolved now. Let me know if it's not.

(No, I don't know what I had originally intended.)

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-06-23 Thread Michael Day

Hi Adam,


Since the stack of open elements always has  at the top of the stack,
the "element in scope" algorithm will always find it, and as a result, the
first part of the condition will always fail.


Even in the fragment case?  (Note the parenthetical remark in the spec
about this text applying only in the fragment case.)


Yes, see 12.4, the stack of open elements always contains a  root 
in the fragment case when there is a context element:


Let root be a new html element with no attributes.
...
Set up the parser's stack of open elements so that it contains just
the single element root.

Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-06-22 Thread Adam Barth
On Thu, Apr 18, 2013 at 12:27 AM, Michael Day  wrote:
> Another issue regarding recent changes to 12.2.5.5 "The rules for parsing
> tokens in foreign content".
>
> When a HTML start tag is seen (specifically "b", "big", "blockquote",
> "body", "br", "center", "code", ...) the following procedure is given to
> recover from the parse error:
>
> """
> If the stack of open elements does not have an element in scope that is a
> MathML text integration point, an HTML integration point, or an element in
> the HTML namespace, or if the stack of open elements has only one element,
> then process the token using the rules for the "in body" insertion mode.
> (fragment case)
> """
>
> Since the stack of open elements always has  at the top of the stack,
> the "element in scope" algorithm will always find it, and as a result, the
> first part of the condition will always fail.

Even in the fragment case?  (Note the parenthetical remark in the spec
about this text applying only in the fragment case.)

Adam


> This seems unintentional, and depends upon the exact way in which the
> "element in scope" algorithm is defined.
>
> Perhaps rewriting this paragraph without reference to the "element in scope"
> algorithm would make the intent clearer? For example:
>
> If the stack of open elements does not any elements that are MathML text
> integration points, or HTML integration points, or that are in the HTML
> namespace, or if the stack of open elements has only one element ...
>
> Any thoughts?
>
> Best regards,
>
> Michael