Interesting. I wouldn't put money on how long it'll be before you can actually rely on this algorithm. But I get your point. Is there a reference implementation of it?
:Marco On Wed, Oct 6, 2010 at 2:54 PM, Edward O'Connor <[email protected]> wrote: > Hi, > > [Taken off-list as this isn't really node-specific anymore.] > > > @Edward, the html parser in libxml2 is very good. In some preliminary > > tests, I've done, it does pretty well even with crappy markup. > > Fundamentally, I'm interested in DOM consistency. Given the same > sequence of bytes, does the libxml2 HTML parser generate the same DOM > that the major browsers do? > > > When you say "browser-compatible" > > When I say "browser-compatible," I mean > > > http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#parsing > > > that doesn't mean much because each browser has their own parser, and > > when you dig in you'll find that there is quite a bit of difference > > between them. > > All four browser engines are converging on the same parsing algorithm, > linked above. Which means that, going forward, all five major browsers > will produce the same DOM from the same arbitrary-pile-of-bytes that > passes for HTML on the web. > > Which means that there's really no reason for people to implement or use > other tag soup parsing algorithms. > > > Ted > -- Marco Rogers [email protected] Life is ten percent what happens to you and ninety percent how you respond to it. - Lou Holtz
