On Fri, Jun 18, 2010 at 12:24:25PM +1000, Lie Ryan wrote: > On 06/18/10 06:41, Rick Pasotto wrote: > > I'm using BeautifulSoup to process a webpage. One of the fields has a > > unicode character in it. (It's the 'registered trademark' symbol.) When > > I try to write this string to another file I get this error: > > > > UnicodeEncodeError: 'ascii' codec can't encode characters in position > > 31-32: ordinal not in range(128) > > > > In the interpreter the offending string portion shows as: > > 'Realtors\xc2\xae'. > > > > How can I deal with this single string? The rest of the document works > > fine. > > You need to tell BeautifulSoup the encoding of the HTML document. You > can encode this information in either the: > > - (preferred) Encoding is specified externally from HTTP Header > ContentType declaration, e.g.: > Content-Type: text/html; charset=utf-8 > > - HTML ContentType declaration: e.g. > <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
The document has: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> When I look at the document in vim and when I 'print' in python I see the two characters of an acented capital A and the circled 'r'. > latin1word = 'Sacr\xe9 bleu!' > unicodeword = unicode(latin1word, 'latin-1') > print unicodeword TypeError: decoding Unicode is not supported > If this works but Beautiful Soup doesn't, there's probably a bug in > Beautiful Soup. However, if this doesn't work, the problem's with your > Python setup. Python is playing it safe and not sending non-ASCII > characters to your terminal. There are two ways to override this behavior. > > 1. The easy way is to remap standard output to a converter that's not > afraid to send ISO-Latin-1 or UTF-8 characters to the terminal. > > import codecs > import sys > streamWriter = codecs.lookup('utf-8')[-1] > sys.stdout = streamWriter(sys.stdout) > > codecs.lookup returns a number of bound methods and other objects > related to a codec. The last one is a StreamWriter object capable of > wrapping an output stream. Those four lines executed but I still get TypeError: decoding Unicode is not supported > Remember, even if your terminal display is restricted to ASCII, you can > still use Beautiful Soup to parse, process, and write documents in UTF-8 > and other encodings. You just can't print certain strings with print. I can print the string fine. It's f.write(string_with_unicode) that fails with: UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128) Shouldn't I be able to f.write() *any* 8bit byte(s)? repr() gives: u"Realtors\\xc2\\xae" BTW, I'm running python 2.5.5 on debian linux. -- "Making fun of born-again christians is like hunting dairy cows with a high powered rifle and scope." -- P.J. O'Rourke Rick Pasotto r...@niof.net http://www.niof.net _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor