[whatwg] Should ambiguous ampersand be a parse error?
HTML5 authors: The HTML5 spec says that an ambiguous ampersand (e.g. something; undefined) is not allowed in element content, and in section on HTML parsing, that this should throw a parse error. However, browsers seem to render an ambiguous ampersand verbatim, which appear to be a good thing to do. Is the specification intended to have compliant HTML agents stop parsing ambiguous ampersands? I suggest it would be better to amend the specification to say that HTML5 agents should accept an ambiguous ampersand and render the text verbatim (as plain text characters), rather than throwing a parse error. Is there a historic or technical reason for the specification wanting to treat an ambiguous ampersand as a parse error? Peter.
Re: [whatwg] Should ambiguous ampersand be a parse error?
On 12/10/13 11:11 AM, Peter Cashin wrote: The HTML5 spec says that an ambiguous ampersand (e.g. something; undefined) is not allowed in element content Right, that's an authoring requirement. and in section on HTML parsing, that this should throw a parse error. There is no throwing of parse errors in the HTML spec. I assume you're looking at the anything else case of http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#consume-a-character-reference ? This says, for the case you're looking at: If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character () consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error. And if you follow the link to parse error it's http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parse-error and basically has to do with validators needing to report them and UAs being allowed (but not required) to stop parsing here if they really want. If they do NOT want to abort on the error (which is the common case, btw), the spec defines how they press on. And the way they press on is by returning nothing from the consume a character reference algorithm. What that does depends on the caller, but in the case you're talking about that's presumably http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#character-reference-in-data-state and what it will do if nothing is returned is emit the '' and move on to the next character. So basically treats the '' as not special in any way in this case, leading to the behavior you observe in browsers. Is the specification intended to have compliant HTML agents stop parsing ambiguous ampersands? Compliant HTML agents are allowed to do so, I guess, per the technical rules about parse errors, just like for any other parse error. But I expect that this is at least partly for conformance classes other than browsers; all browsers press on through parse errors in HTML. Maybe the allowed behavior for parse errors should be made conditional on conformance class... -Boris
Re: [whatwg] Should ambiguous ampersand be a parse error?
2013-12-10 19:45, Boris Zbarsky wrote: On 12/10/13 11:11 AM, Peter Cashin wrote: The HTML5 spec says that an ambiguous ampersand (e.g. something; undefined) is not allowed in element content Right, that's an authoring requirement. Authoring requirements as such are just policy statements, therefore regularly ignored. They are supposed to communicate something, but as the late prof. Wiio so wisely stated, communication usually fails, except by accident (and he was an optimist). There is no throwing of parse errors in the HTML spec. Well, yes, throwing belongs to the DOM and to scripting. The question is whether some construct is parsed in a particular way or not. Is the specification intended to have compliant HTML agents stop parsing ambiguous ampersands? Compliant HTML agents are allowed to do so, I guess, per the technical rules about parse errors, just like for any other parse error. But I expect that this is at least partly for conformance classes other than browsers; all browsers press on through parse errors in HTML. Maybe the allowed behavior for parse errors should be made conditional on conformance class... Allowing user agents to stop parsing after a parse error (BTW, where exactly does the WHATWG HTML Living Standard allow that?) is really just avoidance. If browsers actually apply some specific error recovery, what’s the excuse for not making that mandatory? Different user agents can really do very different things. But I don’t think it’s a good idea to make that a rule of *parsing HTML*. Yucca
Re: [whatwg] Should ambiguous ampersand be a parse error?
On 12/10/13 2:33 PM, Jukka K. Korpela wrote: Authoring requirements as such are just policy statements, therefore regularly ignored. In this case, it's an eminently validator-enforceable authoring requirement. Allowing user agents to stop parsing after a parse error (BTW, where exactly does the WHATWG HTML Living Standard allow that?) Did you try following the links in my mail? Let me try again, but this time do actually follow the link: http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parse-error If browsers actually apply some specific error recovery, what’s the excuse for not making that mandatory? For example, it allows a validator or other conformance checker to just stop at the first parse error. In fact, the spec goes to some trouble to allow that and discuss conformance checker behavior around parse errors, if you read the link above. -Boris
Re: [whatwg] Should ambiguous ampersand be a parse error?
2013-12-10 22:20, Boris Zbarsky wrote: In this case, it's an eminently validator-enforceable authoring requirement. That’s a more or less a wannabe-normative requirement that “validators” are supposed to enforce. There is no real HTML5 validator so far (not surprising, as there is no HTML5), but the point is that nobody who does not use a “validator” will see the requirement as “enforced”- Allowing user agents to stop parsing after a parse error (BTW, where exactly does the WHATWG HTML Living Standard allow that?) Did you try following the links in my mail? Let me try again, but this time do actually follow the link: http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parse-error “This section only applies to user agents, data mining tools, and conformance checkers.” So what about conformance of documents? If browsers are allowed to quit, or to proceed, then this is a very theoretic proposition. Technically, it does not define document conformance, does it? Yucca
Re: [whatwg] Should ambiguous ampersand be a parse error?
On 12/10/13 4:41 PM, Jukka K. Korpela wrote: Allowing user agents to stop parsing after a parse error (BTW, where exactly does the WHATWG HTML Living Standard allow that?) Did you try following the links in my mail? Let me try again, but this time do actually follow the link: http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parse-error “This section only applies to user agents, data mining tools, and conformance checkers.” So what about conformance of documents? You asked where the standard says that a user agent can stop after a parse error. That's the section linked above. Conformance of documents is pretty simple: any document with a parse error is non-conformant last I checked, though exactly where it says that varies depending on the syntactic construct. If browsers are allowed to quit, or to proceed, then this is a very theoretic proposition. Conformance of documents is always a theoretic proposition if all possible inputs have defined processing. -Boris
Re: [whatwg] Should ambiguous ampersand be a parse error?
Boris Zbarsky Jukka K. Korpela: Thank you for you responses -- they are much appreciated. Sorry I talked about throwing a parse error, the specification does not say anything like that. It is just that I had thought that a parse error should be quite a serious issue -- but it seems that is not necessarily the case. You have given me confidence that browsers will continue to parse an ambiguous ampersand as normal text that is parsed verbatim, despite the fact that the specification says this is a parse error. Thanks again, Peter.