[issue7311] Bug on regexp of HTMLParser

2011-04-07 Thread Ezio Melotti
Changes by Ezio Melotti : -- resolution: -> fixed stage: commit review -> committed/rejected status: open -> closed ___ Python tracker ___ ___

[issue7311] Bug on regexp of HTMLParser

2011-04-07 Thread Roundup Robot
Roundup Robot added the comment: New changeset 225400cb6e84 by Ezio Melotti in branch '3.2': #7311: fix html.parser to accept non-ASCII attribute values. http://hg.python.org/cpython/rev/225400cb6e84 New changeset a1dea7cde58f by Ezio Melotti in branch 'default': #7311: merge with 3.2. http://h

[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Ezio Melotti
Ezio Melotti added the comment: On 3.2 the patch changes only the range of chars matched by the regex when the attribute value doesn't have quotes and strict=True. The parser already allowed unquotes attribute values even before the patch (in both strict and tolerant mode), but used an explic

[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Senthil Kumaran
Senthil Kumaran added the comment: > So is the issue7311-3.diff patch fine? Just that it allows unquoted attrs for unicode too. My previous suggestion was not to allow unquoted attribute values, but as the change is already made in 2.7 and discussion pointed out a portion in 4.1 spec which

[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread R. David Murray
R. David Murray added the comment: Sounds fine to me. -- ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http:/

[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Ezio Melotti
Ezio Melotti added the comment: So is the issue7311-3.diff patch fine? It changes the strict regex to match the 2.7 one, and leave the tolerant one unchanged (even if now the two regexs are really close). -- ___ Python tracker

[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Senthil Kumaran
Senthil Kumaran added the comment: We need not base changes to html/parser.py on html5 spec, but rather make changes based on the requirements on parsers which may rely on this library. Like the tolerant mode was brought in issue1486713 for some practical reasons and it was seen useful tor pa

[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Éric Araujo
Éric Araujo added the comment: Okay, sounds good. -- ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mai

[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Ezio Melotti
Ezio Melotti added the comment: I would agree if the HTMLParser was compliant with the HTML 4.01 specs, but since it's more permissive and uses its own heuristic to determine what should be parsed and what shouldn't, I think it's better to use already existing heuristics (either the HTML5 one

[issue7311] Bug on regexp of HTMLParser

2011-04-06 Thread Éric Araujo
Éric Araujo added the comment: I think the stdlib should comply with HTML 4.01, and in the future HTML 5. (FTR, I don’t think XHTML is useful, and deny that XHTML-compatible HTML exists. See http://bugs.python.org/issue11567#msg131509 :) -- ___ Py

[issue7311] Bug on regexp of HTMLParser

2011-04-05 Thread Ezio Melotti
Ezio Melotti added the comment: I don't see many use cases for the strict mode. It is not strict enough to be used for validation, and while parsing HTML I can't think of any other case where I would want an exception raised (always as long as what is parsed by the tolerant mode is a superse

[issue7311] Bug on regexp of HTMLParser

2011-04-05 Thread R. David Murray
R. David Murray added the comment: The goal of tolerant mode is to accept anything a typical browser would accept. I suspect that means the tolerant regex should stay, but I don't remember the details. As for the strictas far as I know the current module follows 4.01, not 5. I'm not su

[issue7311] Bug on regexp of HTMLParser

2011-04-05 Thread Ezio Melotti
Ezio Melotti added the comment: With 3.2 the situation is more complicated because there is a strict and a non-strict mode. The strict mode uses: attrfind = re.compile( r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*' r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?') and the toler

[issue7311] Bug on regexp of HTMLParser

2011-04-05 Thread Roundup Robot
Roundup Robot added the comment: New changeset 7d4dea76c476 by Ezio Melotti in branch '2.7': #7311: fix HTMLParser to accept non-ASCII attribute values. http://hg.python.org/cpython/rev/7d4dea76c476 -- nosy: +python-dev ___ Python tracker

[issue7311] Bug on regexp of HTMLParser

2011-04-03 Thread Ezio Melotti
Ezio Melotti added the comment: Here's a patch that matches unquoted attribute values according to the HTML5 specifications. The regex uses \s even if this includes the \v char that, according to the HTML5 specs, shouldn't be included. I left it there for simplicity and backward-compatibili

[issue7311] Bug on regexp of HTMLParser

2011-04-03 Thread Ezio Melotti
Changes by Ezio Melotti : -- assignee: -> ezio.melotti ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://m

[issue7311] Bug on regexp of HTMLParser

2011-03-27 Thread Ezio Melotti
Ezio Melotti added the comment: The HTML 4.01 specifications says[0]: """ In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46

[issue7311] Bug on regexp of HTMLParser

2011-03-26 Thread Ezio Melotti
Ezio Melotti added the comment: The attached patch changes the regex to allow non-ascii letters in attribute values (using \w with the re.UNICODE flag instead of [a-zA-Z0-9_]). Using [^>\s] (or even [^> ]) might be OK too, since that's what browsers seem to use (e.g. Firefox and Chrome show "

[issue7311] Bug on regexp of HTMLParser

2011-03-20 Thread Éric Araujo
Changes by Éric Araujo : -- nosy: +eric.araujo versions: +Python 3.1, Python 3.2, Python 3.3 -Python 2.6 ___ Python tracker ___ ___ Pyt

[issue7311] Bug on regexp of HTMLParser

2009-11-19 Thread Chiyuan Zhang
Chiyuan Zhang added the comment: re: Yes. In fact, the BTW is a different problem with respect to this bug. And that seems to be more complicated to fix. -- ___ Python tracker __

[issue7311] Bug on regexp of HTMLParser

2009-11-19 Thread Glenn Linderman
Glenn Linderman added the comment: Re: the BTW -- < and > should be entity-escaped when used in attribute values inside tag attributes... (but are probably seldom found as part of tag attribute values) But the example you showed is not an attribute in a tag, but rather text within a paired tag.

[issue7311] Bug on regexp of HTMLParser

2009-11-13 Thread Ezio Melotti
Changes by Ezio Melotti : -- nosy: +ezio.melotti priority: -> normal stage: -> test needed versions: +Python 2.7 ___ Python tracker ___ _

[issue7311] Bug on regexp of HTMLParser

2009-11-13 Thread Fred L. Drake, Jr.
Changes by Fred L. Drake, Jr. : -- nosy: +fdrake ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.pyt

[issue7311] Bug on regexp of HTMLParser

2009-11-12 Thread Chiyuan Zhang
New submission from Chiyuan Zhang : Hi all, I'm using BeautifulSoup to parsing an HTML page and find it refused to parse the page. By looking at the backtrace, I found it is a problem with the python built-in HTMLParser.py. In fact, the web page I'm parsing is with some Chinese characters. there