As currently specified, the tokeniser is mostly, but not completely, independent of the treebuiilder.

The major obstacle for an independent tokeniser seems to be that the content model flag is set to RCDATA, RAWTEXT or PLAINTEXT by the treebuilder and not by the tokeniser. In most cases, the new content model flag is entirely predictable from the start tag (and RCDATA/ RAWTEXT element names are known to the tokeniser already). The only exceptions I have found so far concern start tags within <select> and <frameset>, which are dropped by the treebuilder and therefore do not cause the content model flag to change. Even these cases could perhaps have been handled by the tokeniser without too much trouble (and without changing the spec) if it were not for the "in select in table" insertion mode, where a missing </select> end tag may be inferred depending on the stack of open elements.

It seems unfortunate to abandon the possibility of an independent tokeniser just to handle what appears to be a corner case of a corner case, viz, unclosed RCDATA/RAWTEXT elements inside an unclosed <select> element in a table. The easiest solution would be to switch the content model flag upon seeing an RCDATA/RAWTEXT/PLAINTEXT start tag irrespective of insertion mode, i.e., also within <select> and <frameset>, which would allow the tokeniser to take care of this without added complexity. Other solutions might be worth considering if this is found to be too incompatible with existing pages. (I could have a look at the the http://www.dotnetdotcom.org/ dataset if that would be of any use.)

(A tiny bit of context: I recently implemented most of the tokeniser in lex in the view of using it as a tool to investigate the use of named character references in existing documents. It uses about 20 start conditions instead of the spec's 39 states and two flags, is fairly compact and readable (500 lines compared to 5,500 in the Validator.nu implementation), and runs about 35 times faster than the full Validator.nu HTML Parser (both under highly suboptimal conditions). Unfortunately, it is of little use without a treebuilder to set the content model flag. It has been pointed out that use cases for which a tree is not needed may not require perfect tokenisation; even if that be true, it is much more difficult to assure that an approximate implementation is sufficiently close than to follow the specification; perhaps more importantly, removing unnecessary dependencies and allowing the tokeniser to run on its own would also make it easier to develop and test a tokeniser for use as part of a full parser.)

--
Øistein E. Andersen

Reply via email to