[xml] less-than character and HTML parser module

Christian Schoenebeck Mon, 13 Apr 2015 16:04:19 -0700

Hi!

I just encountered an issue with stand-alone less-than characters if the 
document is parsed by libxml2's HTML parser module. Consider you have a text 
in your HTML document like:


        a < b

The less-than sign in this case is interpreted by the HTML parser module as 
tag start, causing subsequent text (in this case "< b") to be dropped. It is 
not well-formed HTML to have less-than signs raw like this, however in 
practice it often occurs with text sections in HTML files this way and 
browsers cope with it.

If allowed, I would provide a patch to address this issue. My suggestion: if 
the next character following the less-than character is in
(' ' | \n | \r | \t | 0 | '=') then the token is interpreted as text, not as 
element. Relevant code section:  HTMLparser.c -> htmlParseContent()

Another option would be to recover the original read position if 
htmlParseHTMLName() failed. Currently it drops the entire supposed element. 
Relevant code section:  HTMLparser.c -> htmlParseStartTag().

Best regards,
Christian Schoenebeck
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
https://mail.gnome.org/mailman/listinfo/xml

[xml] less-than character and HTML parser module

Reply via email to