On Wed, Oct 06, 2010 at 02:19:32PM -0700, David Gatwood wrote: > On Oct 6, 2010, at 10:08 AM, [email protected] wrote: > > > On Wed, Oct 6, 2010 at 12:18 AM, Steven Falken wrote: > >> Hi, > >> I'm trying to parse bare.txt (attached, yes it is simply cnn.com). For > >> this purpose I'm using parse.c (also attached). > >> The output is output.txt (Attachment!). > >> If you look at bare.txt, you see a <script> block from line 826 to > >> line 886. Now if you look at output.txt, you see the > >> <script>-Tag in line 759, but the end-Tag (</script>) is in line 784; > >> the problem is, that this end-Tag is in the middle > >> of the javascript-code, which is actually bad :( > > > > This is because cnn's HTML sucks :). They can't seem to make up their > > mind between HTML and XHTML. > > > > Take a look at line 792 of output.txt: the for statement is mangled. > > Looks like the '<' operator was interpreted by libxml as a start tag. > > The </script> is in the place where a </a> is in bare.txt > > > > Perhaps libxml2 betrayed its true nature (an XML parser) and parsed > > bare.txt as XML (XHTML). In this case the content of <script> is also > > parsed as, and must be valid XML (which it isn't). > > See http://javascript.about.com/library/blxhtml.htm > > Alternatively, this is yet another reason why inline JavaScript should be > avoided if at all possible. Use the src, Luke.
HTML specification says where the <script> boundaries should end. libxml2 HTML parser follows their recommendations. But a number of HTML generators just fail to do this properly. Try to use the HTML_PARSE_RECOVER option to parse such documents. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ [email protected] | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
