Hi!
I just encountered an issue with stand-alone less-than characters if the
document is parsed by libxml2's HTML parser module. Consider you have a text
in your HTML document like:
a < b
The less-than sign in this case is interpreted by the HTML parser module as
tag start, causing subsequent text (in this case "< b") to be dropped. It is
not well-formed HTML to have less-than signs raw like this, however in
practice it often occurs with text sections in HTML files this way and
browsers cope with it.
If allowed, I would provide a patch to address this issue. My suggestion: if
the next character following the less-than character is in
(' ' | \n | \r | \t | 0 | '=') then the token is interpreted as text, not as
element. Relevant code section: HTMLparser.c -> htmlParseContent()
Another option would be to recover the original read position if
htmlParseHTMLName() failed. Currently it drops the entire supposed element.
Relevant code section: HTMLparser.c -> htmlParseStartTag().
Best regards,
Christian Schoenebeck
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
https://mail.gnome.org/mailman/listinfo/xml