[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-02-14 Thread R. David Murray

Changes by R. David Murray :


--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-03-08 Thread Alexander

Alexander  added the comment:

This is small patch for related bug issue9577 which actually is not related to 
this bug.

--
nosy: +friday
Added file: http://bugs.python.org/file21045/cdata_patch.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-03-08 Thread Alexander

Alexander  added the comment:

And this patch fix the both bugs in more elegant way

--
Added file: http://bugs.python.org/file21046/cdata_patch.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-03-12 Thread Ezio Melotti

Ezio Melotti  added the comment:

Thanks for the patch, however it would be better if you could get a clone of 
the CPython repo and make a patch against it.
The patch should also include tests.

You can check http://docs.python.org/devguide/ for more information.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2010-09-30 Thread Yotam Medini

Yotam Medini  added the comment:

The HTMLParser.py fails when inside 
   ... 
it can fooled by JavaScript with less-than '<' conditional expressions.
In the attached example:

 $ tar tvzf lt-in-script-example.tgz | cut -c24-
 796 2010-09-30 16:52 h2t.py
   23678 2010-09-30 16:39 t.html

here's what happens:

 $ python h2t.py t.html /tmp/t.txt
 HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py
 Traceback (most recent call last):
   File "h2t.py", line 31, in 
 text = html2text(f_html.read())
   File "h2t.py", line 23, in html2text
 te = TextExtractor(html)
   File "h2t.py", line 15, in __init__
 self.feed(html)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed
 self.goahead(0)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead
 k = self.parse_starttag(i)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in 
parse_starttag
 endpos = self.check_for_whole_start_tag(i)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in 
check_for_whole_start_tag
 self.error("malformed start tag")
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error
 raise HTMLParseError(message, self.getpos())
 HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332


I have a suggested patch 
   HTMLParser.diff
fixing this problem, soon to be attached.

-- yotam

--
nosy: +yotam
Added file: http://bugs.python.org/file19072/lt-in-script-example.tgz

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2010-09-30 Thread Yotam Medini

Yotam Medini  added the comment:

The attached suggested patch fixes the problems shown in msg117762.

--
Added file: http://bugs.python.org/file19073/HTMLParser.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2010-11-02 Thread Éric Araujo

Éric Araujo  added the comment:

Would it be reasonable to add knowledge to html.parser to make it recognize 
script elements as CDATA and handle it correctly (that is let “<” pass)?

--
nosy: +eric.araujo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-01-02 Thread Yotam Medini

Changes by Yotam Medini :


Added file: http://bugs.python.org/file20231/endtag-space.html

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-01-02 Thread Yotam Medini

Changes by Yotam Medini :


Added file: http://bugs.python.org/file20232/dollar-extra.html

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-01-02 Thread Yotam Medini

Yotam Medini  added the comment:

Suggested fix for the attached cases:
  lt-in-script-example.tgz
  endtag-space.html
  dollar-extra.html

--
Added file: http://bugs.python.org/file20233/ltscr-endtag-dollarext.diff

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2010-08-13 Thread R. David Murray

Changes by R. David Murray :


--
nosy: +Hunanyan

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2010-08-17 Thread Mark Lawrence

Changes by Mark Lawrence :


--
versions: +Python 3.1

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2008-12-01 Thread Chris Palmer

Chris Palmer <[EMAIL PROTECTED]> added the comment:

Here is an additional test case. I have a super simple HTML "minifier"
that burps when given this test file:


$ cat test.html 
'foo '


The explosion is:


$ ./minify.py test.html 
Warning: malformed start tag
'foo Traceback (most recent call last):
  File "./minify.py", line 84, in 
m.feed(f.read())
  File "/usr/local/lib/python2.5/HTMLParser.py", line 108, in feed
self.goahead(0)
  File "/usr/local/lib/python2.5/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
  File "/usr/local/lib/python2.5/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
  File "/usr/local/lib/python2.5/HTMLParser.py", line 302, in
check_for_whole_start_tag
raise AssertionError("we should not get here!")
AssertionError: we should not get here!


--
nosy: +cpalmer
versions: +Python 2.5 -Python 2.3
Added file: http://bugs.python.org/file12183/minify.py

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2008-12-01 Thread Chris Palmer

Changes by Chris Palmer <[EMAIL PROTECTED]>:


--
type:  -> behavior

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2008-07-20 Thread Georg Brandl

Georg Brandl <[EMAIL PROTECTED]> added the comment:

Adding test suite patch from #674449; closed that as a duplicate.

--
nosy: +georg.brandl
Added file: http://bugs.python.org/file10952/patch-test-cdata.txt

___
Python tracker <[EMAIL PROTECTED]>

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2009-03-06 Thread Gabriel Sean Farrell

Gabriel Sean Farrell  added the comment:

Now that BeautifulSoup uses HTMLParser, more people are seeing these
errors. See
http://groups.google.com/group/beautifulsoup/msg/d5a7540620538d14 and
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=516824

--
nosy: +gsf

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2009-06-03 Thread Paweł Widera

Changes by Paweł Widera :


--
nosy: +momat

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2009-06-03 Thread Paweł Widera

Paweł Widera  added the comment:

A simple workaround for the BeautifulSoup is the following wrapper. It
sanitize the javascript code before passing it to the parser by joining
the disjoint strings, so that "" becomes "".

def bs(input):
pattern = re.compile('\"\+\"')
match = lambda x: ""
massage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
massage.extend([(pattern, match)])
return BeautifulSoup(input, markupMassage=massage)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2009-06-06 Thread Ezio Melotti

Changes by Ezio Melotti :


--
nosy: +ezio.melotti
versions: +Python 2.7, Python 3.2 -Python 2.5

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-07-26 Thread Matt Basta

Matt Basta  added the comment:

The number of problems produced by this bug can be greatly reduced by adding a 
relatively small check to the parser. Currently, 

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

2011-07-26 Thread Ezio Melotti

Ezio Melotti  added the comment:

I left a review about your patch on rietveld, including a description of what I 
think it's going on there (the patch lacks some context and it's not easy to 
figure out how everything works there).
I also did some tests with and without the patch:

>>> from HTMLParser import HTMLParser as HP
>>> class MyHP(HP):
...   def handle_data(self, data): print 'data: %r' % data
... 
>>> myhp = MyHP()

# without the patch:
>>> myhp.feed('foobar')
data: 'foobar'  # this looks ok
>>> myhp.feed('

foo

') data: 'foo' # where's the ? >>> myhp.feed('

foo

bar') data: 'foo' # some tags missing, 2 chunks received data: 'bar' >>> myhp.feed("

foo

'' bar") data: 'foo' data: " '" Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead k = self.parse_endtag(i) File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag self.error("bad end tag: %r" % (rawdata[i:j],)) File "/usr/lib/python2.7/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: "", at line 1, column 247 # with the patch: >>> myhp.feed('foobar') data: 'foobar' # ok >>> myhp.feed('

foo

') data: 'foo' # all the content is there, but why 2 chunks? data: '' >>> myhp.feed('

foo

bar') data: 'foo' # same as previous data: '' data: 'bar' data: '' >>> myhp.feed("

foo

'' bar") data: 'foo' # same data: '' data: " '" data: "" data: "' bar" data: '' So my question is: is it normal that the data is passed to handle_data in chunks? AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me. If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag (', often false for ). Changing the regex to explicitly look for the closing tag might be better (but still fail for e.g.