Re: Python HTML parser chokes on UTF-8 input

2008-10-17 Thread John Nagle
Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like

Re: Python HTML parser chokes on UTF-8 input

2008-10-10 Thread Marc 'BlackJack' Rintsch
On Fri, 10 Oct 2008 00:13:36 +0200, Johannes Bauer wrote: Terry Reedy schrieb: I believe you are confusing unicode with unicode encoded into bytes with the UTF-8 encoding. Having a problem feeding a unicode string, not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte string.

Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Johannes Bauer
Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like this: prs =

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Terry Reedy
Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Johannes Bauer
Terry Reedy schrieb: Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes.

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Jerry Hill
On Thu, Oct 9, 2008 at 4:54 PM, Johannes Bauer [EMAIL PROTECTED] wrote: Hello group, Now when I take website directly from the parser, everything is fine. However I want to do some modifications before I parse it, namely UTF-8 modifications in the style: website = website.replace(uföö,

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Terry Reedy
Johannes Bauer wrote: Terry Reedy schrieb: Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser