[whatwg] Another bug in the HTML parsing spec?

David Flanagan Mon, 17 Oct 2011 16:44:17 -0700

In the HTML spec, "The rules for parsing tokens in foreign content"include an algorithm for "any other end tag". This is the algorithm atthe very end ofhttp://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html.

I think there are some problems with this algorithm and would appreciateany insight anyone has:

1) Step 3 includes an instruction to jump to the last step in the listof steps. But the last step begins "Otherwise", which sounds like it isan else clause. Jumping into an else clause is confusing enough that Iwonder if there is an error in the algorithm wording.

2) I can't get all of the parser tests from html5lib to pass with thisalgorithm as it is currently written. In particular, there are 5 testsin testdata/tree-construction/tests9.dat of this basic form:


<!DOCTYPE html><body><table><math><mi>foo</mi></math></table>

As the spec is written, the <mi> tag is a text integration point, so the"foo" text token is handled like regular content, not like foreigncontent. And since it is in a table, it isn't inserted right away butis stored as pending table text. Then, when the </mi> tag is processed,it is processed as foreign content, going through the algorithm I'mtalking about here. That pops it off the stack, and then reprocessesthe </mi> tag as regular content. This causes the pending table text tobe inserted, but since the <mi> has already been popped off the stack,the text gets inserted into the <math> element instead of the <mi> element.

The workaround I've found (I'm not confident that this is the correctworkaround) is to change step 3 of the algorithm so that it only popsthe stack if there is no pending table text. Another potentialworkaround is to use the existence of pending table text as a conditionfor sending tokens to the regular insertion mode rather than treatingthem as foreign content.

3) In this set of testshttp://code.google.com/p/html5lib/source/browse/testdata/tree-construction/webkit01.datthere is this test:


<math><mrow><mrow><mn>1</mn></mrow><mi>a</mi></mrow></math>

When the first </mrow> tag is parsed, it is handled as foreign content,and gets popped off the stack in step 3. Then, the token is reprocessedin body mode. It is treated in the "any other end tag" case. Since thetop of the stack happens to be another mrow tag, that one gets poppedtoo. (Other tests don't fail here because they don't happen to have twoof the same tags on the stack). This means that the <mi> element endsup as a child of the <math> element instead of the outer <mrow> element.


    David Flanagan

[whatwg] Another bug in the HTML parsing spec?

Reply via email to