Re: [whatwg] Another bug in the HTML parsing spec?

David Flanagan Tue, 18 Oct 2011 11:29:11 -0700

On 10/17/11 5:47 PM, Ian Hickson wrote:

On Mon, 17 Oct 2011, David Flanagan wrote:

In the HTML spec, "The rules for parsing tokens in foreign content"
include an algorithm for "any other end tag".  This is the algorithm at
the very end of
http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html.


I think there are some problems with this algorithm and would appreciate
any insight anyone has:

1) Step 3 includes an instruction to jump to the last step in the list
of steps.  But the last step begins "Otherwise", which sounds like it is
an else clause.  Jumping into an else clause is confusing enough that I
wonder if there is an error in the algorithm wording.

Yeah, that's bogus. The "last step" it's referring to has been removed (it
used to reset the insertion mode). I've fixed the spec.

Thanks.  With that change, my problem #3 below goes away, as you suspected.

2) I can't get all of the parser tests from html5lib to pass with this
algorithm as it is currently written.  In particular, there are 5 tests in
testdata/tree-construction/tests9.dat of this basic form:

<!DOCTYPE html><body><table><math><mi>foo</mi></math></table>

As the spec is written, the<mi>  tag is a text integration point, so the "foo"
text token is handled like regular content, not like foreign content.

Oh, my, yeah, that's all kinds of wrong. The text node should be handled
as if it was in the "in body" mode, not as if it was "in table". I'll have
to study this closer.

I think this broke when we moved away from using an insertion mode for
foreign content.

Here's my current workaround:

In 13.2.5, in the rules for whether to use the current insertion mode orto insert the token as foreign content, if the token is being insertedbecause the current node is a math (or HTML, but I'm not sure aboutthat) integration point, then first set a text_integration_mode flag,then invoke the current insertion mode, then clear the flag.

And in the in table insertion mode, when a character token is inserted,and the text_integration_mode flag is set, then just process the tokenusing in body mode, and otherwise follow the directions that are there now.

I'm not sure that is the best way to fix the spec, but it works for me,in the sense that my parser now passes the tests.


    David

Henri, do you know how Gecko gets this right currently?

The workaround I've found (I'm not confident that this is the correct
workaround) is to change step 3 of the algorithm so that it only pops
the stack if there is no pending table text.  Another potential
workaround is to use the existence of pending table text as a condition
for sending tokens to the regular insertion mode rather than treating
them as foreign content.

We shouldn't be ending up with pending table text here at all. It should
go straight into the mi element.

3) In this set of tests
http://code.google.com/p/html5lib/source/browse/testdata/tree-construction/webkit01.dat
there is this test:

<math><mrow><mrow><mn>1</mn></mrow><mi>a</mi></mrow></math>

When the first</mrow>  tag is parsed, it is handled as foreign content,
and gets popped off the stack in step 3. Then, the token is reprocessed
in body mode.  It is treated in the "any other end tag" case.  Since the
top of the stack happens to be another mrow tag, that one gets popped
too.  (Other tests don't fail here because they don't happen to have two
of the same tags on the stack).  This means that the<mi>  element ends
up as a child of the<math>  element instead of the outer<mrow>  element.

That should be fixed with the updated spec text now, right?

Re: [whatwg] Another bug in the HTML parsing spec?

Reply via email to