subject:"\[issue670664\] HTMLParser.py \- more robust SCRIPT tag parsing"

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-02-14 Thread R. David Murray


Changes by R. David Murray :


--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-03-08 Thread Alexander


Alexander  added the comment:

This is small patch for related bug issue9577 which actually is not related to 
this bug.

--
nosy: +friday
Added file: http://bugs.python.org/file21045/cdata_patch.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-03-08 Thread Alexander


Alexander  added the comment:

And this patch fix the both bugs in more elegant way

--
Added file: http://bugs.python.org/file21046/cdata_patch.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-03-12 Thread Ezio Melotti


Ezio Melotti  added the comment:

Thanks for the patch, however it would be better if you could get a clone of 
the CPython repo and make a patch against it.
The patch should also include tests.

You can check http://docs.python.org/devguide/ for more information.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2010-09-30 Thread Yotam Medini


Yotam Medini  added the comment:

The HTMLParser.py fails when inside 
   ... 
it can fooled by JavaScript with less-than '<' conditional expressions.
In the attached example:

 $ tar tvzf lt-in-script-example.tgz | cut -c24-
 796 2010-09-30 16:52 h2t.py
   23678 2010-09-30 16:39 t.html

here's what happens:

 $ python h2t.py t.html /tmp/t.txt
 HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py
 Traceback (most recent call last):
   File "h2t.py", line 31, in 
 text = html2text(f_html.read())
   File "h2t.py", line 23, in html2text
 te = TextExtractor(html)
   File "h2t.py", line 15, in __init__
 self.feed(html)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed
 self.goahead(0)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead
 k = self.parse_starttag(i)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in 
parse_starttag
 endpos = self.check_for_whole_start_tag(i)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in 
check_for_whole_start_tag
 self.error("malformed start tag")
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error
 raise HTMLParseError(message, self.getpos())
 HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332


I have a suggested patch 
   HTMLParser.diff
fixing this problem, soon to be attached.

-- yotam

--
nosy: +yotam
Added file: http://bugs.python.org/file19072/lt-in-script-example.tgz

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2010-09-30 Thread Yotam Medini


Yotam Medini  added the comment:

The attached suggested patch fixes the problems shown in msg117762.

--
Added file: http://bugs.python.org/file19073/HTMLParser.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2010-11-02 Thread Éric Araujo


Éric Araujo  added the comment:

Would it be reasonable to add knowledge to html.parser to make it recognize 
script elements as CDATA and handle it correctly (that is let “<” pass)?

--
nosy: +eric.araujo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-01-02 Thread Yotam Medini


Changes by Yotam Medini :


Added file: http://bugs.python.org/file20231/endtag-space.html

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-01-02 Thread Yotam Medini


Changes by Yotam Medini :


Added file: http://bugs.python.org/file20232/dollar-extra.html

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-01-02 Thread Yotam Medini


Yotam Medini  added the comment:

Suggested fix for the attached cases:
  lt-in-script-example.tgz
  endtag-space.html
  dollar-extra.html

--
Added file: http://bugs.python.org/file20233/ltscr-endtag-dollarext.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2010-08-13 Thread R. David Murray


Changes by R. David Murray :


--
nosy: +Hunanyan

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2010-08-17 Thread Mark Lawrence


Changes by Mark Lawrence :


--
versions: +Python 3.1

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2008-12-01 Thread Chris Palmer


Chris Palmer <[EMAIL PROTECTED]> added the comment:

Here is an additional test case. I have a super simple HTML "minifier"
that burps when given this test file:


$ cat test.html 
'foo '


The explosion is:


$ ./minify.py test.html 
Warning: malformed start tag
'foo Traceback (most recent call last):
  File "./minify.py", line 84, in 
m.feed(f.read())
  File "/usr/local/lib/python2.5/HTMLParser.py", line 108, in feed
self.goahead(0)
  File "/usr/local/lib/python2.5/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
  File "/usr/local/lib/python2.5/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
  File "/usr/local/lib/python2.5/HTMLParser.py", line 302, in
check_for_whole_start_tag
raise AssertionError("we should not get here!")
AssertionError: we should not get here!


--
nosy: +cpalmer
versions: +Python 2.5 -Python 2.3
Added file: http://bugs.python.org/file12183/minify.py

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2008-12-01 Thread Chris Palmer


Changes by Chris Palmer <[EMAIL PROTECTED]>:


--
type:  -> behavior

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2008-07-20 Thread Georg Brandl


Georg Brandl <[EMAIL PROTECTED]> added the comment:

Adding test suite patch from #674449; closed that as a duplicate.

--
nosy: +georg.brandl
Added file: http://bugs.python.org/file10952/patch-test-cdata.txt

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2009-03-06 Thread Gabriel Sean Farrell


Gabriel Sean Farrell  added the comment:

Now that BeautifulSoup uses HTMLParser, more people are seeing these
errors. See
http://groups.google.com/group/beautifulsoup/msg/d5a7540620538d14 and
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=516824

--
nosy: +gsf

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2009-06-03 Thread Paweł Widera


Changes by Paweł Widera :


--
nosy: +momat

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2009-06-03 Thread Paweł Widera


Paweł Widera  added the comment:

A simple workaround for the BeautifulSoup is the following wrapper. It
sanitize the javascript code before passing it to the parser by joining
the disjoint strings, so that "" becomes "".

def bs(input):
pattern = re.compile('\"\+\"')
match = lambda x: ""
massage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
massage.extend([(pattern, match)])
return BeautifulSoup(input, markupMassage=massage)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2009-06-06 Thread Ezio Melotti


Changes by Ezio Melotti :


--
nosy: +ezio.melotti
versions: +Python 2.7, Python 3.2 -Python 2.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-07-26 Thread Matt Basta


Matt Basta  added the comment:

The number of problems produced by this bug can be greatly reduced by adding a 
relatively small check to the parser. Currently,

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-07-26 Thread Ezio Melotti


Ezio Melotti  added the comment:

I left a review about your patch on rietveld, including a description of what I 
think it's going on there (the patch lacks some context and it's not easy to 
figure out how everything works there).
I also did some tests with and without the patch:

>>> from HTMLParser import HTMLParser as HP
>>> class MyHP(HP):
...   def handle_data(self, data): print 'data: %r' % data
... 
>>> myhp = MyHP()

# without the patch:
>>> myhp.feed('foobar')
data: 'foobar'  # this looks ok
>>> myhp.feed('foo')
data: 'foo'  # where's the ?
>>> myhp.feed('foo
bar')
data: 'foo' # some tags missing, 2 chunks received
data: 'bar'
>>> myhp.feed("foo
 '' bar")
data: 'foo'
data: " '"
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
  File "/usr/lib/python2.7/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "", at line 1, column 247


# with the patch:
>>> myhp.feed('foobar')
data: 'foobar'  # ok
>>> myhp.feed('foo')
data: 'foo' # all the content is there, but why 2 chunks?
data: ''
>>> myhp.feed('foo
bar')
data: 'foo' # same as previous
data: ''
data: 'bar'
data: ''
>>> myhp.feed("foo
 '' bar")  
data: 'foo' # same
data: ''
data: " '"
data: ""
data: "' bar"
data: ''

So my question is: is it normal that the data is passed to handle_data in 
chunks?
AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care 
about, so the fact that further parsing happens on the content of script/style 
seems wrong to me.
If I'm reading the code correctly that's because the "interesting" regex is set 
to look for a closing tag (', often false for 
).
Changing the regex to explicitly look for the closing tag might be better (but 
still fail for e.g.