[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread Martin Potthast

Changes by Martin Potthast :


--
title: HTMLParser.unescape() cannot handle HTML entities with incorrect syntax 
(e.g. &#hearts;) -> HTMLParser.unescape() fails on HTML entities with incorrect 
syntax (e.g. &#hearts;)

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread Martin Potthast

Martin Potthast  added the comment:

I'd suggest to better verify the input and return such strings unchanged.

--
type:  -> behavior

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread R. David Murray

R. David Murray  added the comment:

Leaving the input unchanged does seem to be what browsers do.  (Issue 7626 has 
some info on browser behaviour with invalid entity refs.)

Rather than pre-validating the input, I think the exception can be caught and 
the putative entity returned unchanged, similar to how a keyerror is handled 
for a named entity.

Would you have any interest in proposing a patch and tests?

--
nosy: +orsenthil, r.david.murray
stage:  -> needs patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread Martin Potthast

Martin Potthast  added the comment:

Agreed. Here's a patch for HTMLParser. That was easy enough.

With regard to tests, there seems to be already one called 
test_malformatted_charref in test_htmlparser.py. However, the test tests the 
whole parser and not only HTMLParser.unescape().

At the same time, HTMLParser.unescape() has the following comment:
"# Internal -- helper to remove special character quoting"

It appears the syntax check is done in line 168 already, but since the unescape 
function is publicly visible, I'd say that it should be capable of handling all 
kinds of malformed input, despite that comment. Maybe this comment should be 
removed.

I'm not entirely sure how to write the test properly, since it doesn't fit into 
the framework provided by test_htmlparser.py; and unfortunately, my time is 
rather short at the moment.

--
keywords: +patch
Added file: http://bugs.python.org/file20141/HTMLParser.py.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread R. David Murray

R. David Murray  added the comment:

Ah, as an undocumented internal interface it may in fact not be appropriate to 
make this change.  Or it may be.  I'll have to look at the code in more detail 
to figure that out, or perhaps Senthil will.  (It may even be time to document 
the function, we'll see.)

Thanks for the patch.

--
stage: needs patch -> unit test needed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-22 Thread Martin Potthast

Martin Potthast  added the comment:

Why not simply remove the additional check in line 168 and leave the 
responsibility to check the validity of its input to the unescape function (be 
it explicitly or, like now, lazily). That way, the code changes are minimal, 
the existing test covers the current issue, and the function gets more robust.

By the way, I came across this function via Stackoverflow:
http://stackoverflow.com/questions/2087370

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-23 Thread Senthil Kumaran

Senthil Kumaran  added the comment:

Yes, I too agree that HTMLParser.unescape() should split-out malformed char-ref 
just as other browsers do.

But, as unescape function has undocumented/unexposed for releases, I am not 
sure making it exposed is a good idea. HTMLParser is more for event based 
parsing of tags, and unescape is a just a helper function in that context.

Given that reasoning if you see the malformatted test, you see that event based 
parsing does return the malformatted data properly For e.g -  ("data", 
"&#bad;").

Only calling unescape explicitly does not exhibit this behavior.

Martin: I am not sure if changing something in line 168 would solve the issue. 
In that particular block of code, the else condition is responsible for 
throwing the malformed charref on an event. If would like to elaborate a bit 
more on your suggestion, it would be helpful.

However, I do agree that unescape can be changed as per your patch and I have 
added a simple test to exercise that change. I think, this can go in.

--
assignee:  -> orsenthil
stage: unit test needed -> patch review
versions: +Python 2.7, Python 3.1, Python 3.2 -Python 2.6
Added file: http://bugs.python.org/file20147/Issue10759.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-28 Thread Senthil Kumaran

Senthil Kumaran  added the comment:

Fixed this in r87542 in (py3k). unescape is undocumented helper method, so no 
docs are added.

There was already an issue ( Issue6662) on malformed charref handling and it is 
fixed.

--
resolution:  -> fixed
stage: patch review -> committed/rejected

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10759] HTMLParser.unescape() fails on HTML entities with incorrect syntax (e.g. &#hearts; )

2010-12-28 Thread Senthil Kumaran

Senthil Kumaran  added the comment:

r87544 (release27-maint) and r87545 (release31-maint).

--
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com