Re: [whatwg] Should ambiguous ampersand be a parse error?

2014-01-22 Thread Ian Hickson
On Tue, 10 Dec 2013, Boris Zbarsky wrote:
 On 12/10/13 11:11 AM, Peter Cashin wrote:
  
  Is the specification intended to have compliant HTML agents stop 
  parsing ambiguous ampersands?
 
 Compliant HTML agents are allowed to do so, I guess, per the technical 
 rules about parse errors, just like for any other parse error.  But I 
 expect that this is at least partly for conformance classes other than 
 browsers; all browsers press on through parse errors in HTML.  Maybe 
 the allowed behavior for parse errors should be made conditional on 
 conformance class...

While I agree that it's unlikely that any browser will ever make use of 
this in its default mode, I've still allowed it, because it can be a 
useful mode to use in an authoring or educational environment.


On Tue, 10 Dec 2013, Jukka K. Korpela wrote:
 
 Authoring requirements as such are just policy statements, therefore 
 regularly ignored.

Conformance requirements for authors are really just a way to try to help 
authors avoid making what they would consider mistakes. The specification 
actually has a whole section that explains why we bother to have them:

   http://whatwg.org/html#conformance-requirements-for-authors


 Allowing user agents to stop parsing after a parse error (BTW, where 
 exactly does the WHATWG HTML Living Standard allow that?)

It's in the sentence that follows the one that defines parse error:

   http://whatwg.org/html#parse-error


 is really just avoidance.

Not sure what you mean by avoidance. What does it avoid?


 If browsers actually apply some specific error recovery, what’s the 
 excuse for not making that mandatory?

We allow these two implementation strategies because not all tools 
actually need to recover. For example, an HTML publishing pipeline might 
want to assume that its input is valid, and simply refuse to handle 
invalid input, rather than applying the error handling rules (which can 
cause a big mess, e.g. reordering content!).


 Different user agents can really do very different things. But I don’t 
 think it’s a good idea to make that a rule of *parsing HTML*.

It's not really different things, it's either doing what the spec says, or 
aborting early.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

[whatwg] Should ambiguous ampersand be a parse error?

2013-12-10 Thread Peter Cashin
HTML5 authors:

The HTML5 spec says that an ambiguous ampersand (e.g. something; undefined) is 
not allowed in element content, and in section on HTML parsing, that this 
should throw a parse error. 

However, browsers seem to render an ambiguous ampersand verbatim, which appear 
to be a good thing to do.

Is the specification intended to have compliant HTML agents stop parsing 
ambiguous ampersands?

I suggest it would be better to amend the specification to say that HTML5 
agents should accept an ambiguous ampersand and render the text verbatim (as 
plain text characters), rather than throwing a parse error.

Is there a historic or technical reason for the specification wanting to treat 
an ambiguous ampersand as a parse error?

Peter.



Re: [whatwg] Should ambiguous ampersand be a parse error?

2013-12-10 Thread Boris Zbarsky

On 12/10/13 11:11 AM, Peter Cashin wrote:

The HTML5 spec says that an ambiguous ampersand (e.g. something; undefined) is 
not allowed in element content


Right, that's an authoring requirement.


and in section on HTML parsing, that this should throw a parse error.


There is no throwing of parse errors in the HTML spec.

I assume you're looking at the anything else case of 
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#consume-a-character-reference 
?  This says, for the case you're looking at:


  If no match can be made, then no characters are consumed, and nothing
  is returned. In this case, if the characters after the U+0026
  AMPERSAND character () consist of a sequence of one or more
  alphanumeric ASCII characters followed by a U+003B SEMICOLON
  character (;), then this is a parse error.

And if you follow the link to parse error it's 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parse-error 
and basically has to do with validators needing to report them and UAs 
being allowed (but not required) to stop parsing here if they really 
want.  If they do NOT want to abort on the error (which is the common 
case, btw), the spec defines how they press on.


And the way they press on is by returning nothing from the consume a 
character reference algorithm.  What that does depends on the caller, 
but in the case you're talking about that's presumably 
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#character-reference-in-data-state 
and what it will do if nothing is returned is emit the '' and move on 
to the next character.  So basically treats the '' as not special in 
any way in this case, leading to the behavior you observe in browsers.



Is the specification intended to have compliant HTML agents stop parsing 
ambiguous ampersands?


Compliant HTML agents are allowed to do so, I guess, per the technical 
rules about parse errors, just like for any other parse error.  But I 
expect that this is at least partly for conformance classes other than 
browsers; all browsers press on through parse errors in HTML.  Maybe 
the allowed behavior for parse errors should be made conditional on 
conformance class...


-Boris


Re: [whatwg] Should ambiguous ampersand be a parse error?

2013-12-10 Thread Jukka K. Korpela

2013-12-10 19:45, Boris Zbarsky wrote:


On 12/10/13 11:11 AM, Peter Cashin wrote:

The HTML5 spec says that an ambiguous ampersand (e.g. something;
undefined) is not allowed in element content


Right, that's an authoring requirement.


Authoring requirements as such are just policy statements, therefore 
regularly ignored. They are supposed to communicate something, but as 
the late prof. Wiio so wisely stated, communication usually fails, 
except by accident (and he was an optimist).



There is no throwing of parse errors in the HTML spec.


Well, yes, throwing belongs to the DOM and to scripting. The question is 
whether some construct is parsed in a particular way or not.



Is the specification intended to have compliant HTML agents stop
parsing ambiguous ampersands?


Compliant HTML agents are allowed to do so, I guess, per the technical
rules about parse errors, just like for any other parse error.  But I
expect that this is at least partly for conformance classes other than
browsers; all browsers press on through parse errors in HTML.  Maybe
the allowed behavior for parse errors should be made conditional on
conformance class...


Allowing user agents to stop parsing after a parse error (BTW, where 
exactly does the WHATWG HTML Living Standard allow that?) is really just 
avoidance. If browsers actually apply some specific error recovery, 
what’s the excuse for not making that mandatory?


Different user agents can really do very different things. But I don’t 
think it’s a good idea to make that a rule of *parsing HTML*.


Yucca




Re: [whatwg] Should ambiguous ampersand be a parse error?

2013-12-10 Thread Boris Zbarsky

On 12/10/13 2:33 PM, Jukka K. Korpela wrote:

Authoring requirements as such are just policy statements, therefore
regularly ignored.


In this case, it's an eminently validator-enforceable authoring requirement.


Allowing user agents to stop parsing after a parse error (BTW, where
exactly does the WHATWG HTML Living Standard allow that?)


Did you try following the links in my mail?  Let me try again, but this 
time do actually follow the link: 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parse-error



If browsers actually apply some specific error recovery,
what’s the excuse for not making that mandatory?


For example, it allows a validator or other conformance checker to just 
stop at the first parse error.  In fact, the spec goes to some trouble 
to allow that and discuss conformance checker behavior around parse 
errors, if you read the link above.


-Boris


Re: [whatwg] Should ambiguous ampersand be a parse error?

2013-12-10 Thread Jukka K. Korpela

2013-12-10 22:20, Boris Zbarsky wrote:


In this case, it's an eminently validator-enforceable authoring
requirement.


That’s a more or less a wannabe-normative requirement that “validators” 
are supposed to enforce. There is no real HTML5 validator so far (not 
surprising, as there is no HTML5), but the point is that nobody who does 
not use a “validator” will see the requirement as “enforced”-



Allowing user agents to stop parsing after a parse error (BTW, where
exactly does the WHATWG HTML Living Standard allow that?)


Did you try following the links in my mail?  Let me try again, but this
time do actually follow the link:
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parse-error


“This section only applies to user agents, data mining tools, and 
conformance checkers.” So what about conformance of documents?


If browsers are allowed to quit, or to proceed, then this is a very 
theoretic proposition. Technically, it does not define document 
conformance, does it?


Yucca





Re: [whatwg] Should ambiguous ampersand be a parse error?

2013-12-10 Thread Boris Zbarsky

On 12/10/13 4:41 PM, Jukka K. Korpela wrote:

Allowing user agents to stop parsing after a parse error (BTW, where
exactly does the WHATWG HTML Living Standard allow that?)


Did you try following the links in my mail?  Let me try again, but this
time do actually follow the link:
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parse-error



“This section only applies to user agents, data mining tools, and
conformance checkers.” So what about conformance of documents?


You asked where the standard says that a user agent can stop after a 
parse error.  That's the section linked above.


Conformance of documents is pretty simple: any document with a parse 
error is non-conformant last I checked, though exactly where it says 
that varies depending on the syntactic construct.



If browsers are allowed to quit, or to proceed, then this is a very
theoretic proposition.


Conformance of documents is always a theoretic proposition if all 
possible inputs have defined processing.


-Boris


Re: [whatwg] Should ambiguous ampersand be a parse error?

2013-12-10 Thread Peter Cashin
Boris Zbarsky  Jukka K. Korpela:

Thank you for you responses -- they are much appreciated.

Sorry I talked about throwing a parse error, the specification does not say 
anything like that. It is just that I had thought that a parse error should be 
quite a serious issue --  but it seems that is not necessarily the case. 

You have given me confidence that browsers will continue to parse an ambiguous 
ampersand as normal text that is parsed verbatim, despite the fact that the 
specification says this is a parse error.

Thanks again,
Peter.