>This is the first question in the BeautifulSoup FAQ at
>http://www.crummy.com/software/BeautifulSoup/FAQ.html
>Unfortunately the author of BS considers this a problem with your
Python installation! So it
>seems he doesn't have a good understanding of Python and Unicode.
(OK, I can forgive him
>that, I think there are only a handful of people who really do
understand it completely.)
>The first fix given doesn't work. The second fix works but it is not
a good idea to change the
>default encoding for your Python install. There is a hack you can use
to change the default
>encoding just for one program; in your program put
> reload(sys); sys.setdefaultencoding('utf-8')
>This seems to fix the problem you are having.
>Kent
Hi Kent,
I did read the FAQ before posting, honest :) But it does seem to be
addressing a different issue.
He says to try:
>>> latin1word = 'Sacr\xe9 bleu!'
>>> unicodeword = unicode(latin1word, 'latin-1')
>>> print unicodeword
Sacré bleu!
Which worked fine for me. And then he gives a solution for fixing
-display- problems on the terminal. For instance, his first solution
was :
"The easy way is to remap standard output to a converter that's not
afraid to send ISO-Latin-1 or UTF-8 characters to the terminal."
But I avoided displaying anything in my original example, because I
didn't want to confuse the issue. It's also why I didn't mention the
damning FAQ entry:
>>> y = results[1].a.fetchText(re.compile('.+'))
Is all I am trying to do.
I don't expect non-ASCII characters to display correctly, however I
was suprised when I tried "print x" in my original example, and it
printed. I would have expected to have to do something like:
>>> print x.encode("utf8")
Matt Croydon::Postneo 2.0 » Blog Archive » Mobile Screen Scraping <b>...</b>
I've just looked, and I have to do this explicit encoding under python
2.3.4, but not under 2.4.1. So perhaps 2.4 is less afraid/smarter
about converting and displaying non-ascii characters to the terminal.
Either way, I don't -think- that's my problem with Beautiful Soup.
Changing my default encoding does indeed fix it, but it may be a
reflection of the author making bad assumptions because his default
was set to utf-8. I'm not really experienced enough to tell what is
going on in his code, but I've been trying. Does seem to defeat the
point of unicode, however.
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor