New submission from Anders Hammarquist: Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4) match = charref.match(rawdata, i) if match: ... else: if ";" in rawdata[i:]: #bail by consuming &# self.handle_data(rawdata[0:2]) i = self.updatepos(i, 2) break
if you feed a broken charref, that is non-numeric, it will pass whatever random string that happened to be at the start of rawdata to handle_data(). Eg: p = HTMLParser() p.handle_data = lambda x: sys.stdout.write(x) p.feed('<p>&#foo;</p>') will print '<p' which is clearly wrong. I think the intention of the code is to pass '&#', which seems saner. ---------- components: Library (Lib) messages: 208336 nosy: iko priority: normal severity: normal status: open title: HTMLParse handing of non-numeric charrefs broken type: behavior _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue20288> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com