[whatwg] Potentially avoidable tokeniser/treebuilder dependency

Øistein E . Andersen Tue, 22 Sep 2009 16:01:18 -0700

As currently specified, the tokeniser is mostly, but not completely,independent of the treebuiilder.

The major obstacle for an independent tokeniser seems to be that thecontent model flag is set to RCDATA, RAWTEXT or PLAINTEXT by thetreebuilder and not by the tokeniser. In most cases, the new contentmodel flag is entirely predictable from the start tag (and RCDATA/RAWTEXT element names are known to the tokeniser already). The onlyexceptions I have found so far concern start tags within <select> and<frameset>, which are dropped by the treebuilder and therefore do notcause the content model flag to change. Even these cases couldperhaps have been handled by the tokeniser without too much trouble(and without changing the spec) if it were not for the "in select intable" insertion mode, where a missing </select> end tag may beinferred depending on the stack of open elements.

It seems unfortunate to abandon the possibility of an independenttokeniser just to handle what appears to be a corner case of a cornercase, viz, unclosed RCDATA/RAWTEXT elements inside an unclosed<select> element in a table. The easiest solution would be to switchthe content model flag upon seeing an RCDATA/RAWTEXT/PLAINTEXT starttag irrespective of insertion mode, i.e., also within <select> and<frameset>, which would allow the tokeniser to take care of thiswithout added complexity. Other solutions might be worth consideringif this is found to be too incompatible with existing pages. (I couldhave a look at the the http://www.dotnetdotcom.org/ dataset if thatwould be of any use.)

(A tiny bit of context: I recently implemented most of the tokeniserin lex in the view of using it as a tool to investigate the use ofnamed character references in existing documents. It uses about 20start conditions instead of the spec's 39 states and two flags, isfairly compact and readable (500 lines compared to 5,500 in theValidator.nu implementation), and runs about 35 times faster than thefull Validator.nu HTML Parser (both under highly suboptimalconditions). Unfortunately, it is of little use without a treebuilderto set the content model flag. It has been pointed out that use casesfor which a tree is not needed may not require perfect tokenisation;even if that be true, it is much more difficult to assure that anapproximate implementation is sufficiently close than to follow thespecification; perhaps more importantly, removing unnecessarydependencies and allowing the tokeniser to run on its own would alsomake it easier to develop and test a tokeniser for use as part of afull parser.)


--
Øistein E. Andersen

[whatwg] Potentially avoidable tokeniser/treebuilder dependency

Reply via email to