On 06/18/10 06:41, Rick Pasotto wrote: > I'm using BeautifulSoup to process a webpage. One of the fields has a > unicode character in it. (It's the 'registered trademark' symbol.) When > I try to write this string to another file I get this error: > > UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: > ordinal not in range(128) > > In the interpreter the offending string portion shows as: 'Realtors\xc2\xae'. > > How can I deal with this single string? The rest of the document works > fine.
You need to tell BeautifulSoup the encoding of the HTML document. You can encode this information in either the: - (preferred) Encoding is specified externally from HTTP Header ContentType declaration, e.g.: Content-Type: text/html; charset=utf-8 - HTML ContentType declaration: e.g. <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> - XML declaration -- for XHTML document used for parsing using XML parser (hint: BeautifulSoup isn't XML/XHTML parser), e.g.: <?xml version="1.0" encoding="utf-8"?> However, BeautifulSoup will also uses some heuristics to *guess* the encoding of a tag soup that doesn't have a proper encoding. So, the most likely reason is this, from Beautiful Soup's FAQ: http://www.crummy.com/software/BeautifulSoup/documentation.html#Why can't Beautiful Soup print out the non-ASCII characters I gave it? """ Why can't Beautiful Soup print out the non-ASCII characters I gave it? If you're getting errors that say: "'ascii' codec can't encode character 'x' in position y: ordinal not in range(128)", the problem is probably with your Python installation rather than with Beautiful Soup. Try printing out the non-ASCII characters without running them through Beautiful Soup and you should have the same problem. For instance, try running code like this: latin1word = 'Sacr\xe9 bleu!' unicodeword = unicode(latin1word, 'latin-1') print unicodeword If this works but Beautiful Soup doesn't, there's probably a bug in Beautiful Soup. However, if this doesn't work, the problem's with your Python setup. Python is playing it safe and not sending non-ASCII characters to your terminal. There are two ways to override this behavior. 1. The easy way is to remap standard output to a converter that's not afraid to send ISO-Latin-1 or UTF-8 characters to the terminal. import codecs import sys streamWriter = codecs.lookup('utf-8')[-1] sys.stdout = streamWriter(sys.stdout) codecs.lookup returns a number of bound methods and other objects related to a codec. The last one is a StreamWriter object capable of wrapping an output stream. 2. The hard way is to create a sitecustomize.py file in your Python installation which sets the default encoding to ISO-Latin-1 or to UTF-8. Then all your Python programs will use that encoding for standard output, without you having to do something for each program. In my installation, I have a /usr/lib/python/sitecustomize.py which looks like this: import sys sys.setdefaultencoding("utf-8") For more information about Python's Unicode support, look at Unicode for Programmers or End to End Unicode Web Applications in Python. Recipes 1.20 and 1.21 in the Python cookbook are also very helpful. Remember, even if your terminal display is restricted to ASCII, you can still use Beautiful Soup to parse, process, and write documents in UTF-8 and other encodings. You just can't print certain strings with print. """ _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor