Hi all, I'm using BeautifulSoup to parsing an HTML page and find it refused to parse the page. By looking at the backtrace, I found it is a problem with the python built-in HTMLParser.py. In fact, the web page I'm parsing is with some Chinese characters. there is a tag like <img src=/foo/bar.png alt=中文> , note this is legacy html page where the attributes are not quoted. However, the regexp defined in HTMLParser.py is :
attrfind = re.compile( r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*' r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_...@]*))?') Note that the Chinese character (also any other non-english characters), so it fire an error parsing this. I'm not sure whether the HTML standard allow un-quoted non-ASCII characters in the attributes. If it allows, this seems to be a bug. and the regexp to better be [^>\s] IMHO. BTW: It seems something like : <script> var st = "<a></"; </script> can not be parsed. :-/ -- pluskid http://blog.pluskid.org _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com