When the tokenization state machine is defined, every state first
"consumes" and then potentially "emits". Some of the states transfer to
another state with an order to "re-consume the character in the next
state". This means that what you do in the new state is dependant on
what you did in the last state and that the "comsume" is necessarily an
inconsistent operation. A much better wording would be "look at the next
character" and on state transition "consume and emit" or just "emit
without consumption" making it clear when the input cursor moves.

I did the same in Twintsam with PeekChar/PeekChars and EatChar/EatChars methods.
http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs
(beware, Twintsam hasn't been updated since January so it's not in
sync with the spec as it is now)

though actually you could just use a character queue into which you
push back characters that needs to be "re-consumed" (i.e. you
"un-read" the character and then you switch to the other state).
This is what html5lib does:
http://html5lib.googlecode.com/svn/trunk/python/src/tokenizer.py
(search for self.stream.queue; this needs to be refactored with an
unread() method on the HTMLInputStream)

That is to say, I don't think the spec should be changed at all. It's
just a matter of how you implement it. You just have to know that the
"queue" won't ever be larger than 9 characters as there are tweaks for
0-prefixed numeric entities and/or numeric entities greater 1114111.

It would be nice if all <!...> tags (except comments) were considered
"declarations" instead of bogus comments. Then DOCTYPE wouldn't need
special handling by the tokenizer, just special handling by the parser.
(Too much of the parser seems to have gotten into the tokenizer; with
CDATA and RCDATA, this is a necessary evil. With <!DOCTYPE ...> it
isn't.)

I can't see the problem here; plus DOCTYPE parsing is special because
we need the DOCTYPE name.
Moreover, the spec has changed recently so that DOCTYPE parsing takes
care of PUBLIC and SYSTEM identifiers.

--
Thomas Broyer

Reply via email to