[issue26009] HTMLParser lacking a few features to reconstruct input exactly

Jason Sachs Mon, 04 Jan 2016 09:38:44 -0800

New submission from Jason Sachs:

The HTMLParser class (https://docs.python.org/2/library/htmlparser.html) is 
lacking a few features to reconstruct input exactly. For the most part it can 
do this, but I found two items where it falls short (there may be others):


- There is a get_starttag_text() method but no get_endtag_text() method, which 
is necessary if the end tag is not in canonical form, e.g. instead of </p> it 
is </P> or </   P >

- The effect of the parse_bogus_comment() internal method is to call 
handle_comment(), so content like <! I AM BOGUS > cannot be distinguished by 
subclasses of HTMLParser from actual comments <!-- I AM BOGUS -->

Suggested changes:

- Add a get_endtag_text() method to return the exact endtag text
- change parse_bogus_comment to call self.handle_bogus_comment(), and define 
self.handle_bogus_comment() to call self.handle_comment(). This way it is 
backwards-compatible with existing behavior, but subclasses can redefine 
self.handle_bogus_comment() to do what they want.

----------
messages: 257472
nosy: jason_s
priority: normal
severity: normal
status: open
title: HTMLParser lacking a few features to reconstruct input exactly
type: behavior
versions: Python 2.7

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue26009>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue26009] HTMLParser lacking a few features to reconstruct input exactly

Reply via email to