Ezio Melotti <ezio.melo...@gmail.com> added the comment: I left a review about your patch on rietveld, including a description of what I think it's going on there (the patch lacks some context and it's not easy to figure out how everything works there). I also did some tests with and without the patch:
>>> from HTMLParser import HTMLParser as HP >>> class MyHP(HP): ... def handle_data(self, data): print 'data: %r' % data ... >>> myhp = MyHP() # without the patch: >>> myhp.feed('<script>foobar</script>') data: 'foobar' # this looks ok >>> myhp.feed('<script><p>foo</p></script>') data: '<p>foo' # where's the </p>? >>> myhp.feed('<script><p>foo</p><span>bar</span></script>') data: '<p>foo' # some tags missing, 2 chunks received data: 'bar' >>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>") data: '<p>foo' data: " '" Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead k = self.parse_endtag(i) File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag self.error("bad end tag: %r" % (rawdata[i:j],)) File "/usr/lib/python2.7/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247 # with the patch: >>> myhp.feed('<script>foobar</script>') data: 'foobar' # ok >>> myhp.feed('<script><p>foo</p></script>') data: '<p>foo' # all the content is there, but why 2 chunks? data: '</p>' >>> myhp.feed('<script><p>foo</p><span>bar</span></script>') data: '<p>foo' # same as previous data: '</p>' data: '<span>bar' data: '</span>' >>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>") data: '<p>foo' # same data: '</p>' data: " '" data: "</scr'+'ipt>" data: "' <span>bar" data: '</span>' So my question is: is it normal that the data is passed to handle_data in chunks? AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me. If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of <style>, often false for <script>). Changing the regex to explicitly look for the closing tag might be better (but still fail for e.g. <script> document.write('<script>alert("foo")</script>')</script> -- but some browsers will fail with this too). ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue670664> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com