On Mon, Feb 12, 2007 at 07:42:13AM -0500, Elliotte Harold wrote: > How robust is > > xmllint --html --xmlout > > Is it possible to confuse it so badly it won't continue or will generate > ill-formed markup? Or will it keep on trucking no matter what?
The HTML parser will generate an in-memory tree, no matter what. The tree may be bizarre from an XML perspective as a result. The XML serializer don't try to detect error conditions, though we have fixed some case where the two options generated non-well formed XML in the past and corrected them. > How does the HTML parser handle bogons (unrecognized elements)? Are they > treated as empty or dropped or something else? The HTML parser will try to preserve as much data as possible in the case of errors. > How good an alternative is this for TagSoup and Tidy? I would have to understand TagSoup and Tidy internals to answer this, so I can't. Point is that libxml2 HTML parser won't really try to 'fix' the input, it will raise errors message when facing things it doesn't understand, most of the policies about how to correct problems are IMHO dependant on the use case and there is the tree API to fix things accordingly to needs. > I'm working on a book about converting messy old HTML to clean XHTML, > and I'm trying to decide exactly how much of each tool to recommend when. libxml2 HTML parser has been used for many real world tools, like HTML indexers, it will consume mostly anything, but it doesn't try to add much correcting recipes on top of it. This was discussed on the list a couple of years ago, and that's where libxml2 HTML parsing error handling principle were set up. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ [EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
