On Mon, 18 Feb 2008 21:36:17 -0800 (PST), J Peyret wrote > Well, as usual I am confused by unicode encoding errors. > > I have a string with problematic characters in it which I'd like to > put into a postgresql table. > That results in a postgresql error so I am trying to fix things with > <string>.encode > > >>> s = 'he Company\xef\xbf\xbds ticker' > >>> print s > he [UTF-8?]Company�s ticker > >>> > > Trying for an encode: > > >>> print s.encode('utf-8') > Traceback (most recent call last): > File "<input>", line 1, in <module> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position > 10: ordinal not in range(128) > > OK, that's pretty much as expected, I know this is not valid utf-8.
Actually, the string *is* valid UTF-8, but you're confused about encoding and decoding. Encoding is the process of turning a Unicode object into a byte string. Decoding is the process of turning a byte string into a Unicode object. You need to decode your byte string into a Unicode object, and then encode the result to a byte string in a different encoding. For example: >>> s = 'he Company\xef\xbf\xbds ticker' >>> s.decode("utf-8").encode("ascii", "xmlcharrefreplace") 'he Company�s ticker' By the way, whether this is the correct fix for your PostgreSQL error is not clear, since you kept that error message a secret for some reason. There could be a better solution than transcoding the string in this way, but we won't know until you show us the actual error you're trying to fix. At the moment, it's like showing you the best way to inflate a tire with a hammer. Hope this helps, -- Carsten Haese http://informixdb.sourceforge.net -- http://mail.python.org/mailman/listinfo/python-list