Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
On Thu, 4 Jul 2013, Michael Day wrote: > > > > The problem is that we can't do (2) in _all_ cases, e.g. innerHTML on > > an can't possibly break out of the if it sees one of these > > tags, since that's the "root" of what is being parsed. > > Yes, HTML has already lost the composability of parsing that XML and > other languages have, that's long gone. But that doesn't mean we should > try to make it even more irregular :) > > Currently Firefox, Chrome, and Prince all treat the fragment case the > same as the whole document case, so we already have interoperable > behaviour on this issue. If you treated them the same, you would either crash or have an infinite loop, because you'd either pop the root element off the stack and then try to append something to null, or you'd try to reprocess the token without having popped anything first. There has to be _some_ special casing of .innerHTML. What should the special casing be? Consider this case: .innerHTML = '' I can see two possible options: svg | +-- g | +-- P Or: svg | +-- g | +-- P Neither are what happens in the non-fragment case (in that case the is a sibling of the ). Consider this case: .innerHTML = '' Here, the node could be a child of the innermost , the innermost , the outermost , or the outermost . I could see arguments for all those cases. It seems unlikely that the author meant any of them. > Since the HTML spec is supposed to reflect reality, it seems pointless > to deliberately introduce an inconsistency in the parsing model that > requires changes in all user agents to implement. All the user agents (or at least, all the browsers I could test) have to change anyway. Blink-based browsers and WebKit-based browsers don't support innerHTML on at all. Firefox supports innerHTML on but puts all the nodes in the HTML namespace. In conclusion, the reason I simply removed the quirk from fragment parsing rather than trying to make it work is that: - all browsers will have to change anyway, - the quirk needs special handling in the fragment case anyway, - it's not clear what the behaviour should be, - in many cases, we're not error-correcting in a useful way anyway. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
Hi Ian, The problem is that we can't do (2) in _all_ cases, e.g. innerHTML on an can't possibly break out of the if it sees one of these tags, since that's the "root" of what is being parsed. Yes, HTML has already lost the composability of parsing that XML and other languages have, that's long gone. But that doesn't mean we should try to make it even more irregular :) Currently Firefox, Chrome, and Prince all treat the fragment case the same as the whole document case, so we already have interoperable behaviour on this issue. Since the HTML spec is supposed to reflect reality, it seems pointless to deliberately introduce an inconsistency in the parsing model that requires changes in all user agents to implement. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
On Thu, 4 Jul 2013, Michael Day wrote: > > > > We don't have any data that says that we need to support this for > > innerHTML. I think it's a win if we can drop the hack from innerHTML. > > Okay, so allowing some HTML elements to break out of foreign content is > a hack added for historical reasons, that will surprise authors and > complicate implementations and is thus regrettable, but necessary. > > Then there are two possibilities for fragment parsing: > > (1) The hack can be left out of fragment parsing, as there is no > historical justification for it. Since the hack is bad, removing it from > as many situations as possible is good. > > (2) The hack can apply to fragment parsing in the same way as it applies > to regular parsing. This makes parsing behaviour more consistent across > different situations, which is good. > > I'm strongly in favour of (2), as it seems that omitting the hack from > some rare situations doesn't save authors any trouble, and doesn't > follow the principle of least surprise. The problem is that we can't do (2) in _all_ cases, e.g. innerHTML on an can't possibly break out of the if it sees one of these tags, since that's the "root" of what is being parsed. Given that, it's not clear that (2) is better than (1). (I agree that if we could actually always be consistent, it would be.) Note that this isn't the only place like that. ...and: document.createElement('table').innerHTML = ''; ...result in very different DOMs (in the first, the and the are siblings; in the latter, the is a child). > In an ideal world it would be possible to grab any subsection of a > document, parse that in isolation as a fragment, and get the same result > as if it was parsed in its original document context. This is possible > in XML, but not HTML, due to the existing "author-friendly" hacks, and > making the parsing behaviour even more context sensitive doesn't seem > like a good thing. I think we're _so_ far beyond this ideal world that I'm not sure it's worth even looking for it, to be honest. :-) -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
Hi Ian, We don't have any data that says that we need to support this for innerHTML. I think it's a win if we can drop the hack from innerHTML. Okay, so allowing some HTML elements to break out of foreign content is a hack added for historical reasons, that will surprise authors and complicate implementations and is thus regrettable, but necessary. Then there are two possibilities for fragment parsing: (1) The hack can be left out of fragment parsing, as there is no historical justification for it. Since the hack is bad, removing it from as many situations as possible is good. (2) The hack can apply to fragment parsing in the same way as it applies to regular parsing. This makes parsing behaviour more consistent across different situations, which is good. I'm strongly in favour of (2), as it seems that omitting the hack from some rare situations doesn't save authors any trouble, and doesn't follow the principle of least surprise. In an ideal world it would be possible to grab any subsection of a document, parse that in isolation as a fragment, and get the same result as if it was parsed in its original document context. This is possible in XML, but not HTML, due to the existing "author-friendly" hacks, and making the parsing behaviour even more context sensitive doesn't seem like a good thing. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
On Tue, 2 Jul 2013, Michael Day wrote: > > The new text reads: > > "If the parser was originally created for the HTML fragment parsing algorithm, > then act as described in the "any other start tag" entry below. (fragment > case)" > > This always just adds the HTML element in place inside the foreign content, > even if the fragment context element *is* a HTML element! Right, that's the intent. This specific clause is a hack to make certain elements break out of foreign content, because we found some pages that do crazy stuff like: Bla bla Bla bla ...which, prior to SVG being added to HTML, would show two paragraphs, but if we didn't have this hack, it would now just end the page at the tag. > This can't be right, as it means parsing document.body.innerHTML will > behave totally differently to parsing , for no reason. Not totally differently, only differently in the specific cases of these few tags that trigger this wacked behaviour in markup that's broken anyway. We don't have any data that says that we need to support this for innerHTML. I think it's a win if we can drop the hack from innerHTML. > Looking back a couple of years, this section of the spec seems to be > drifting in a random walk away from reality. We can study this further > and try suggesting some text based on what we have implemented so far. Well, when it started it wasn't reality at all, since there was no foreign content support in text/html. :-) -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
Hi Ian, I ended up removing this from the spec for other reasons, so this should be resolved now. Let me know if it's not. (No, I don't know what I had originally intended.) I don't think the new spec is correct. The question is what happens if we are tokenizing some foreign content, and we see an HTML start tag. In the normal case, we pop off all the foreign elements until we get back to the HTML namespace, then reprocess the token. In the fragment case, the context element may be a foreign element, so there was the wrinkle of having to handle that appropriately when we have this fake "root" element that makes everything confusing. The new text reads: "If the parser was originally created for the HTML fragment parsing algorithm, then act as described in the "any other start tag" entry below. (fragment case)" This always just adds the HTML element in place inside the foreign content, even if the fragment context element *is* a HTML element! This can't be right, as it means parsing document.body.innerHTML will behave totally differently to parsing , for no reason. Looking back a couple of years, this section of the spec seems to be drifting in a random walk away from reality. We can study this further and try suggesting some text based on what we have implemented so far. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
On Thu, 18 Apr 2013, Michael Day wrote: > > Another issue regarding recent changes to 12.2.5.5 "The rules for > parsing tokens in foreign content". > > When a HTML start tag is seen (specifically "b", "big", "blockquote", > "body", "br", "center", "code", ...) the following procedure is given to > recover from the parse error: > > """ > If the stack of open elements does not have an element in scope that is a > MathML text integration point, an HTML integration point, or an element in the > HTML namespace, or if the stack of open elements has only one element, then > process the token using the rules for the "in body" insertion mode. (fragment > case) > """ > > Since the stack of open elements always has at the top of the > stack, the "element in scope" algorithm will always find it, and as a > result, the first part of the condition will always fail. I ended up removing this from the spec for other reasons, so this should be resolved now. Let me know if it's not. (No, I don't know what I had originally intended.) -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
Hi Adam, Since the stack of open elements always has at the top of the stack, the "element in scope" algorithm will always find it, and as a result, the first part of the condition will always fail. Even in the fragment case? (Note the parenthetical remark in the spec about this text applying only in the fragment case.) Yes, see 12.4, the stack of open elements always contains a root in the fragment case when there is a context element: Let root be a new html element with no attributes. ... Set up the parser's stack of open elements so that it contains just the single element root. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
On Thu, Apr 18, 2013 at 12:27 AM, Michael Day wrote: > Another issue regarding recent changes to 12.2.5.5 "The rules for parsing > tokens in foreign content". > > When a HTML start tag is seen (specifically "b", "big", "blockquote", > "body", "br", "center", "code", ...) the following procedure is given to > recover from the parse error: > > """ > If the stack of open elements does not have an element in scope that is a > MathML text integration point, an HTML integration point, or an element in > the HTML namespace, or if the stack of open elements has only one element, > then process the token using the rules for the "in body" insertion mode. > (fragment case) > """ > > Since the stack of open elements always has at the top of the stack, > the "element in scope" algorithm will always find it, and as a result, the > first part of the condition will always fail. Even in the fragment case? (Note the parenthetical remark in the spec about this text applying only in the fragment case.) Adam > This seems unintentional, and depends upon the exact way in which the > "element in scope" algorithm is defined. > > Perhaps rewriting this paragraph without reference to the "element in scope" > algorithm would make the intent clearer? For example: > > If the stack of open elements does not any elements that are MathML text > integration points, or HTML integration points, or that are in the HTML > namespace, or if the stack of open elements has only one element ... > > Any thoughts? > > Best regards, > > Michael