Stefan Behnel wrote: > Hi, > > Karl Dubost wrote: >> Nick Kew weighed in and proposed that we should target [6]libxml >> which includes an HTML parser and is already supported by Apache >> server and many other tools. >> >> [6] http://xmlsoft.org/html/libxml-HTMLparser.html >> >> From here it would be interesting to implement HTML 5 parsing >> algorithm into libxml2. It would benefit the community as large. > > Have you tried joining forces with the people who started the C implementation > of html5lib? Maybe they have ideas to contribute or (partially) working code > that you can look at. It may even happen that you get them convinced of the > project. > > In any case, having working implementations in Python and Java should get you > a lot closer to your goal by looking under the hood.
FWIW, I've spent the summer working on a C HTML5 parser which is approaching stability, called Hubbub[1]. It's about as half as fast as libxml2 at parsing the HTML 5 spec with an O(1) treebuilder, and it's fairly easy to bind to the libxml2 interfaces (and is being used in lieu of the libxml2 HTML parser in a small Web browser, NetSurf[2], in the development branch). Note it's a) not buildable as a shared library or b) had a formal release, but if someone wants an HTML5 parser in C, then it's probably not a bad bet. [1] http://www.netsurf-browser.org/projects/hubbub/ [2] http://www.netsurf-browser.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
