Another in our ongoing series on "Parsing Real-World HTML".
It's wrong, of course. But Firefox will accept as HTML escapes
&
>
<
as well as the correct forms
&
>
<
To be "compatible", a Python screen scraper at
http://zesty.ca/python/scrape.py
has a function "htmldecode", which is supposed to recognize
HTML escapes and generate Unicode. (Why isn't this a standard
Python library function? Its inverse is available.)
This uses the regular expression
charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)
to recognize HTML escapes.
Note the ";?", which makes the closing ";" optional.
This seems fine until we hit something valid but unusual like
http://www.example.com?foo=1�
for which "htmldecode" tries to convert "1234567" into
a Unicode character with that decimal number, and gets a
Unicode overflow.
For our own purposes, I rewrote "htmldecode" to require a
sequence ending in ";", which means some bogus HTML escapes won't
be recognized, but correct HTML will be processed correctly.
What's general opinion of this behavior? Too strict, or OK?
John Nagle
SiteTruth
--
http://mail.python.org/mailman/listinfo/python-list