Re: Python HTML parser chokes on UTF-8 input

Terry Reedy Thu, 09 Oct 2008 15:04:46 -0700

Johannes Bauer wrote:

Hello group,


I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse().read(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like this:

I believe you are confusing unicode with unicode encoded into bytes withthe UTF-8 encoding. Having a problem feeding a unicode string, not'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.


prs = self.parserclass(formatter.NullFormatter())
prs.init()
prs.feed(website)
self.__result = prs.get()
prs.close()

Now when I take "website" directly from the parser, everything is fine.
However I want to do some modifications before I parse it, namely UTF-8
modifications in the style:

website = website.replace(u"föö", u"bär")

Therefore, after fetching the web site content, I have to convert it to
UTF-8 first, modify it and convert it back:

website = website.decode("latin1") # produces unicode
website = website.replace(u"föö", u"bär") #remains unicode
website = website.encode("latin1") # produces byte string  in the latin-1 
encoding

This is incredibly ugly IMHO, as I would really like the parser to just
accept UTF-8 input.


To me, code that works is prettier than code that does not.

In 3.0, text strings are unicode, and I believe that is what the parsernow accepts.

However when I omit the reecoding to latin1:

  File "CachedWebParser.py", line 13, in __init__
    self.__process(website)
  File "CachedWebParser.py", line 55, in __process
    prs.feed(website)
  File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
ordinal not in range(128)

When you do not bother to specify some other encoding in an encodingoperation, sgmllib or something deeper in Python tries the defaultencoding, which does not work. Stop being annoyed and tell theinterpreter what you want. It is not a mind-reader.

Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
input - which should (again, IMHO) be the absolute standard for such a
new language.

The first version of Python came out in 1989, I believe, years beforeunicode. One of the features of the new 3.0 version is that is usesunicode as the standard for text.


Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Python HTML parser chokes on UTF-8 input

Reply via email to