filed: http://bugs.python.org/issue7311
On Thu, Nov 12, 2009 at 12:24 AM, Michael Foord <fuzzy...@voidspace.org.uk>wrote: > Hello Zhang Chiyuan, > > Can you file a bug on the Python issue tracker please: > > http://bugs.python.org > > Thanks > > Michael Foord > > Zhang Chiyuan wrote: > >> Hi all, >> >> I'm using BeautifulSoup to parsing an HTML page and find it refused to >> parse the page. By looking at the backtrace, I found it is a problem >> with the python built-in HTMLParser.py. In fact, the web page I'm >> parsing is with some Chinese characters. there is a tag like <img >> src=/foo/bar.png alt=中文> , note this is legacy html page where the >> attributes are not quoted. However, the regexp defined in >> HTMLParser.py is : >> >> attrfind = re.compile( >> r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*' >> r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_...@]*))?') >> >> Note that the Chinese character (also any other non-english >> characters), so it fire an error parsing this. I'm not sure whether >> the HTML standard allow un-quoted non-ASCII characters in the >> attributes. If it allows, this seems to be a bug. and the regexp to >> better be [^>\s] IMHO. >> >> BTW: It seems something like : >> >> <script> >> var st = "<a></"; >> </script> >> >> can not be parsed. :-/ >> >> -- >> pluskid >> http://blog.pluskid.org >> _______________________________________________ >> Python-Dev mailing list >> Python-Dev@python.org >> http://mail.python.org/mailman/listinfo/python-dev >> Unsubscribe: >> http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk >> >> > > > -- > http://www.ironpythoninaction.com/ > > -- pluskid http://blog.pluskid.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com