Johannes Bauer wrote:
Hello group,
I'm trying to use a htmllib.HTMLParser derivate class to parse a website
which I fetched via
httplib.HTTPConnection().request().getresponse().read(). Now the problem
is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The
code is something like this:
I believe you are confusing unicode with unicode encoded into bytes with
the UTF-8 encoding. Having a problem feeding a unicode string, not
'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.
prs = self.parserclass(formatter.NullFormatter())
prs.init()
prs.feed(website)
self.__result = prs.get()
prs.close()
Now when I take "website" directly from the parser, everything is fine.
However I want to do some modifications before I parse it, namely UTF-8
modifications in the style:
website = website.replace(u"föö", u"bär")
Therefore, after fetching the web site content, I have to convert it to
UTF-8 first, modify it and convert it back:
website = website.decode("latin1") # produces unicode
website = website.replace(u"föö", u"bär") #remains unicode
website = website.encode("latin1") # produces byte string in the latin-1
encoding
This is incredibly ugly IMHO, as I would really like the parser to just
accept UTF-8 input.
To me, code that works is prettier than code that does not.
In 3.0, text strings are unicode, and I believe that is what the parser
now accepts.
However when I omit the reecoding to latin1:
File "CachedWebParser.py", line 13, in __init__
self.__process(website)
File "CachedWebParser.py", line 55, in __process
prs.feed(website)
File "/usr/lib64/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib64/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib64/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0:
ordinal not in range(128)
When you do not bother to specify some other encoding in an encoding
operation, sgmllib or something deeper in Python tries the default
encoding, which does not work. Stop being annoyed and tell the
interpreter what you want. It is not a mind-reader.
Annoying, IMHO, that the internal html Parser cannot cope with UTF-8
input - which should (again, IMHO) be the absolute standard for such a
new language.
The first version of Python came out in 1989, I believe, years before
unicode. One of the features of the new 3.0 version is that is uses
unicode as the standard for text.
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list