New submission from Tom Anderl: When the HTMLParser encounters a start tag element that includes: 1. an unquoted attribute as the final attribute 2. an optional '/' character marking the start tag as self-closing 3. no space between the final attribute and the '/' character
the '/' character gets attached to the attribute value and the element is interpreted as not self-closing. This can be illustrated with the following: =============================================================================== import HTMLParser # Begin Monkeypatch #import re #HTMLParser.attrfind = re.compile( # r'((?<=[\'"\s/])[^\s/>][^\s/=>]*)(\s*=+\s*' # r'(\'[^\']*\'|"[^"]*"|(?![\'"])[^/>\s]*))?(?:\s|/(?!>))*') # End Monkeypatch class MyHTMLParser(HTMLParser.HTMLParser): def handle_starttag(self, tag, attrs): print('got starttag: {0} with attributes {1}'.format(tag, attrs)) def handle_endtag(self, tag): print('got endtag: {0}'.format(tag)) MyHTMLParser().feed('<img height=1.0 width=2.0/>') ============================================================================== Running the above code yields the output: got starttag: img with attributes [('height', '1.0'), ('width', '2.0/')] Note the trailing '/' on the 'width' attribute. If I uncomment the monkey patch, the script then yields: got starttag: img with attributes [('height', '1.0'), ('width', '2.0')] got endtag: img Note that the trailing '/' is gone, and an endtag event was generated. ---------- components: Library (Lib) messages: 258013 nosy: Tom Anderl priority: normal severity: normal status: open title: HTMLParser mishandles last attribute in self-closing tag type: behavior versions: Python 2.7 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue26084> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com