[issue7311] Bug on regexp of HTMLParser

2009-11-19 Thread Chiyuan Zhang

Chiyuan Zhang plus...@gmail.com added the comment:

re: Yes. In fact, the BTW is a different problem with respect to this
bug. And that seems to be more complicated to fix.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7311] Bug on regexp of HTMLParser

2009-11-12 Thread Chiyuan Zhang

New submission from Chiyuan Zhang plus...@gmail.com:

Hi all,

I'm using BeautifulSoup to parsing an HTML page and find it refused to
parse the page. By looking at the backtrace, I found it is a problem
with the python built-in HTMLParser.py. In fact, the web page I'm
parsing is with some Chinese characters. there is a tag like img
src=/foo/bar.png alt=中文 , note this is legacy html page where the
attributes are not quoted. However, the regexp defined in
HTMLParser.py is :

 attrfind = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|[^]*|[-a-zA-Z0-9./,:;+*%?!$\(\)_...@]*))?')

Note that the Chinese character (also any other non-english
characters), so it fire an error parsing this. I'm not sure whether
the HTML standard allow un-quoted non-ASCII characters in the
attributes. If it allows, this seems to be a bug. and the regexp to
better be [^\s] IMHO.

BTW: It seems something like :

script
var st = a/;
/script

can not be parsed. :-/

--
components: Library (Lib)
messages: 95162
nosy: pluskid
severity: normal
status: open
title: Bug on regexp of HTMLParser
type: behavior
versions: Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue7311
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com