In the HTML spec, "The rules for parsing tokens in foreign content" include an algorithm for "any other end tag". This is the algorithm at the very end of http://www.whatwg.org/specs/web-apps/current-work/multipage/tree-construction.html.

I think there are some problems with this algorithm and would appreciate any insight anyone has:

1) Step 3 includes an instruction to jump to the last step in the list of steps. But the last step begins "Otherwise", which sounds like it is an else clause. Jumping into an else clause is confusing enough that I wonder if there is an error in the algorithm wording.

2) I can't get all of the parser tests from html5lib to pass with this algorithm as it is currently written. In particular, there are 5 tests in testdata/tree-construction/tests9.dat of this basic form:

<!DOCTYPE html><body><table><math><mi>foo</mi></math></table>

As the spec is written, the <mi> tag is a text integration point, so the "foo" text token is handled like regular content, not like foreign content. And since it is in a table, it isn't inserted right away but is stored as pending table text. Then, when the </mi> tag is processed, it is processed as foreign content, going through the algorithm I'm talking about here. That pops it off the stack, and then reprocesses the </mi> tag as regular content. This causes the pending table text to be inserted, but since the <mi> has already been popped off the stack, the text gets inserted into the <math> element instead of the <mi> element.

The workaround I've found (I'm not confident that this is the correct workaround) is to change step 3 of the algorithm so that it only pops the stack if there is no pending table text. Another potential workaround is to use the existence of pending table text as a condition for sending tokens to the regular insertion mode rather than treating them as foreign content.

3) In this set of tests http://code.google.com/p/html5lib/source/browse/testdata/tree-construction/webkit01.dat there is this test:

<math><mrow><mrow><mn>1</mn></mrow><mi>a</mi></mrow></math>

When the first </mrow> tag is parsed, it is handled as foreign content, and gets popped off the stack in step 3. Then, the token is reprocessed in body mode. It is treated in the "any other end tag" case. Since the top of the stack happens to be another mrow tag, that one gets popped too. (Other tests don't fail here because they don't happen to have two of the same tags on the stack). This means that the <mi> element ends up as a child of the <math> element instead of the outer <mrow> element.

    David Flanagan

Reply via email to