[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

Senthil Kumaran Thu, 23 Dec 2010 06:58:57 -0800

Senthil Kumaran <orsent...@gmail.com> added the comment:

Yes, I too agree that HTMLParser.unescape() should split-out malformed char-ref 
just as other browsers do.


But, as unescape function has undocumented/unexposed for releases, I am not 
sure making it exposed is a good idea. HTMLParser is more for event based 
parsing of tags, and unescape is a just a helper function in that context.

Given that reasoning if you see the malformatted test, you see that event based 
parsing does return the malformatted data properly For e.g -  ("data", 
"&#bad;").

Only calling unescape explicitly does not exhibit this behavior.

Martin: I am not sure if changing something in line 168 would solve the issue. 
In that particular block of code, the else condition is responsible for 
throwing the malformed charref on an event. If would like to elaborate a bit 
more on your suggestion, it would be helpful.

However, I do agree that unescape can be changed as per your patch and I have 
added a simple test to exercise that change. I think, this can go in.

----------
assignee:  -> orsenthil
stage: unit test needed -> patch review
versions: +Python 2.7, Python 3.1, Python 3.2 -Python 2.6
Added file: http://bugs.python.org/file20147/Issue10759.diff

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue10759>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

Reply via email to