Roundup Robot devn...@psf.upfronthosting.co.za added the comment:
New changeset 0a5eb57d5876 by Ezio Melotti in branch '2.7':
#670664: Fix HTMLParser to correctly handle the content of
``script.../script`` and ``style.../style``.
http://hg.python.org/cpython/rev/0a5eb57d5876
New changeset
Ezio Melotti ezio.melo...@gmail.com added the comment:
Fixed, thanks to everyone who contributed to this over the years!
--
resolution: - fixed
stage: commit review - committed/rejected
status: open - closed
___
Python tracker rep...@bugs.python.org
Éric Araujo mer...@netwok.org added the comment:
-def set_cdata_mode(self):
+def set_cdata_mode(self, elem):
Looks like an incompatible behavior change. Is it only an internal method that
will never affect users’ code (even subclasses)?
--
Ezio Melotti ezio.melo...@gmail.com added the comment:
I think it's internal. While it's not explicitly mentioned in the source, the
method is not documented and I don't think people subclassed it. All that it
does is changing the regex used to parse the data, and if someone needs to
change
Ezio Melotti ezio.melo...@gmail.com added the comment:
Attached a new patch with a few more tests and minor refactoring.
--
keywords: +needs review
stage: patch review - commit review
versions: +Python 2.7
Added file: http://bugs.python.org/file23553/issue670664.diff
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
assignee: - ezio.melotti
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
___
___
Changes by Fred L. Drake, Jr. f...@fdrake.net:
--
nosy: -fdrake
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
___
___
Python-bugs-list
Changes by Chris Palmer ch...@isecpartners.com:
--
nosy: -cpalmer
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
___
___
Python-bugs-list
Alexander b3n...@yandex.ru added the comment:
It sounds like the early consensus on python-dev is that html5 support is a
good thing.
Yeah... But wait another 8 years untill these guys decides that there is enough
tests and other cool stuff.
--
Antoine Pitrou pit...@free.fr added the comment:
Yeah... But wait another 8 years untill these guys decides that
there is enough tests and other cool stuff.
Which guys are you talking about?
Granted, this issue has been around for a lng time... but now that we have
a patch that seems ok
Matt Basta bastaw...@gmail.com added the comment:
Seeing as everyone seems pretty satisfied with the 2.7 version, I'd be happy to
put together a patch for 3 as well.
To confirm, though, this fix is NOT going behind the strict parameter, correct?
--
Ezio Melotti ezio.melo...@gmail.com added the comment:
As I said somewhere else, the only use case I can think of where the 'strict'
flag is useful is validation, but AFAIK even in strict mode it's possible to
parse non-valid documents, so I agree it's pretty useless.
Moving to HTML5 and
Antoine Pitrou pit...@free.fr added the comment:
I also think this is a bug that should be fixed. Not being able to parse
real-world HTML is a nuisance.
I agree with Ezio's review comments about the custom regex.
--
assignee: fdrake -
nosy: +pitrou
stage: - patch review
R. David Murray rdmur...@bitdance.com added the comment:
It sounds like the early consensus on python-dev is that html5 support is a
good thing. I'm happy with that. I presume that means the 'strict' keyword in
3.x becomes strict-per-html5, and possibly useless :)
--
Éric Araujo mer...@netwok.org added the comment:
HTML5 being a spec that builds on HTML 4.01 and real-world ways to deal with
non-compliant input, I don’t object to fixes that follow the HTML5 spec.
Regarding backward compatibility, we can break it if we decide that the
behavior we’re
R. David Murray rdmur...@bitdance.com added the comment:
Unless someone else has picked it up, BeautifulSoup is a no longer an issue
since its author has abandoned it. That doesn't change the fact that IMO it
would be nice for our library to handle input generously.
--
Ezio Melotti ezio.melo...@gmail.com added the comment:
I left a review about your patch on rietveld, including a description of what I
think it's going on there (the patch lacks some context and it's not easy to
figure out how everything works there).
I also did some tests with and without the
Éric Araujo mer...@netwok.org added the comment:
Ezio wrote:
myhp.feed('scriptpfoo/p/script')
data: 'pfoo' # where's the /p?
http://www.w3.org/TR/html4/types#type-cdata says:
Although the STYLE and SCRIPT elements use CDATA for their data
model, for these elements, CDATA must be
R. David Murray rdmur...@bitdance.com added the comment:
It's not buggy, but it is also not helpful. This kind of thing is what we
introduced the 'strict' parameter for. And indeed I believe we've fixed some
of these cases thereby. So any additional fixes should go into non-strict mode
in
Matt Basta bastaw...@gmail.com added the comment:
So I think the example is invalid (should escape the ), and that HTMLParser
is not buggy.
On the other hand, the HTML5 spec clearly dictates otherwise:
http://www.w3.org/TR/html5/syntax.html#cdata-rcdata-restrictions
The text in raw text and
R. David Murray rdmur...@bitdance.com added the comment:
Yes, but we don't claim to support HTML5 yet.
The best way to support HTML5 is probably a topic for python-dev.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
Matt Basta bastaw...@gmail.com added the comment:
Yes, but we don't claim to support HTML5 yet.
There's also no claim in the docs or the source that HTMLParser specifically
adheres to HTML4, either.
Ideally, the parser should strive for parity with the functionality of major
web browsers,
R. David Murray rdmur...@bitdance.com added the comment:
I thought HTLM4 conformance was documented somewhere, but I could be wrong.
HTML5, from what I understand (I haven't read the spec), is explicitly or
implicitly following what browsers really do exactly because nobody conformed
to
Ezio Melotti ezio.melo...@gmail.com added the comment:
IIRC we have been following what browsers do in other cases already.
There were also some discussions about supporting HTML5 (see e.g. #7311 and
#3) and the strict vs non-strict mode introduced in Python3.
Note that changing the way
Matt Basta bastaw...@gmail.com added the comment:
The number of problems produced by this bug can be greatly reduced by adding a
relatively small check to the parser. Currently, script and style tags call
set_cdata_mode(), which sets self.interesting to HTMLParser.interesting_cdata.
This is
Ezio Melotti ezio.melo...@gmail.com added the comment:
Thanks for the patch, however it would be better if you could get a clone of
the CPython repo and make a patch against it.
The patch should also include tests.
You can check http://docs.python.org/devguide/ for more information.
Alexander b3n...@yandex.ru added the comment:
This is small patch for related bug issue9577 which actually is not related to
this bug.
--
nosy: +friday
Added file: http://bugs.python.org/file21045/cdata_patch.diff
___
Python tracker
Alexander b3n...@yandex.ru added the comment:
And this patch fix the both bugs in more elegant way
--
Added file: http://bugs.python.org/file21046/cdata_patch.diff
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
Changes by R. David Murray rdmur...@bitdance.com:
--
nosy: +r.david.murray
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
___
___
Changes by Yotam Medini yo...@users.sourceforge.net:
Added file: http://bugs.python.org/file20231/endtag-space.html
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
___
Changes by Yotam Medini yo...@users.sourceforge.net:
Added file: http://bugs.python.org/file20232/dollar-extra.html
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
___
Yotam Medini yo...@users.sourceforge.net added the comment:
Suggested fix for the attached cases:
lt-in-script-example.tgz
endtag-space.html
dollar-extra.html
--
Added file: http://bugs.python.org/file20233/ltscr-endtag-dollarext.diff
___
If you provide some tests augumenting the currently existing tests
test_htmlparser.py and also ensure that no existing test breaks, it
would be help better to review the patch. I do see some changes made
to the regex and parsing. So tests would definitely help.
Éric Araujo mer...@netwok.org added the comment:
Would it be reasonable to add knowledge to html.parser to make it recognize
script elements as CDATA and handle it correctly (that is let “” pass)?
--
nosy: +eric.araujo
___
Python tracker
Yotam Medini yo...@users.sourceforge.net added the comment:
The HTMLParser.py fails when inside
script ... /script
it can fooled by JavaScript with less-than '' conditional expressions.
In the attached example:
$ tar tvzf lt-in-script-example.tgz | cut -c24-
796 2010-09-30 16:52 h2t.py
Yotam Medini yo...@users.sourceforge.net added the comment:
The attached suggested patch fixes the problems shown in msg117762.
--
Added file: http://bugs.python.org/file19073/HTMLParser.diff
___
Python tracker rep...@bugs.python.org
Changes by Mark Lawrence breamore...@yahoo.co.uk:
--
versions: +Python 3.1
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
___
___
Changes by R. David Murray rdmur...@bitdance.com:
--
nosy: +Hunanyan
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
___
___
Python-bugs-list
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
nosy: +ezio.melotti
versions: +Python 2.7, Python 3.2 -Python 2.5
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
___
Changes by Paweł Widera mo...@man.poznan.pl:
--
nosy: +momat
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue670664
___
___
Python-bugs-list mailing
Paweł Widera mo...@man.poznan.pl added the comment:
A simple workaround for the BeautifulSoup is the following wrapper. It
sanitize the javascript code before passing it to the parser by joining
the disjoint strings, so that /scr+ipt becomes /script.
def bs(input):
pattern =
Gabriel Sean Farrell g...@breaksalot.org added the comment:
Now that BeautifulSoup uses HTMLParser, more people are seeing these
errors. See
http://groups.google.com/group/beautifulsoup/msg/d5a7540620538d14 and
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=516824
--
nosy: +gsf
Chris Palmer [EMAIL PROTECTED] added the comment:
Here is an additional test case. I have a super simple HTML minifier
that burps when given this test file:
$ cat test.html
'foo sc'+'ript'
The explosion is:
$ ./minify.py test.html
Warning: malformed start tag
'foo
Changes by Chris Palmer [EMAIL PROTECTED]:
--
type: - behavior
___
Python tracker [EMAIL PROTECTED]
http://bugs.python.org/issue670664
___
___
Python-bugs-list mailing
44 matches
Mail list logo