[issue37071] HTMLParser mistakenly inventing new tags while parsing

Cheryl Sabella Fri, 21 Jun 2019 06:15:18 -0700


Cheryl Sabella <cheryl.sabe...@gmail.com> added the comment:


Thank you for the report.

Looking at the BeautifulSoup source, there is a comment about this scenario:
            # Unlike other parsers, html.parser doesn't send separate end tag
            # events for empty-element tags. (It's handled in
            # handle_startendtag, but only if the original markup looked like
            # <tag/>.)
            #
            # So we need to call handle_endtag() ourselves. Since we
            # know the start event is identical to the end event, we
            # don't want handle_endtag() to cross off any previous end
            # events for tags of this name.


HTMLParser itself produces output such as:
>>> class MyParser(HTMLParser):
...     def handle_starttag(self, tag, attrs):
...         print(f'start: {tag}')
...     def handle_endtag(self, tag):
...         print(f'end: {tag}')
...     def handle_data(self, data):
...         print(f'data: {data}')
...
>>> parser = MyParser()
>>> parser.feed('<p><test></p>')
start: p
start: test
end: p

My suggestion would be to try a different parser in BeautifulSoup [1] to handle 
this.  Even if we wanted to modify HTMLParser, any such change would probably 
be backwards incompatible.

[1] https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

----------
nosy: +cheryl.sabella
resolution:  -> third party
stage:  -> resolved
status: open -> closed

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue37071>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue37071] HTMLParser mistakenly inventing new tags while parsing

Reply via email to