On Feb 18, 10:52 pm, "Carsten Haese" <[EMAIL PROTECTED]> wrote: > On Mon, 18 Feb 2008 21:36:17 -0800 (PST), J Peyret wrote > > > > > Well, as usual I am confused by unicode encoding errors. > > > I have a string with problematic characters in it which I'd like to > > put into a postgresql table. > > That results in a postgresql error so I am trying to fix things with > > <string>.encode > > > >>> s = 'he Company\xef\xbf\xbds ticker' > > >>> print s > > he [UTF-8?]Company�s ticker > > > Trying for an encode: > > > >>> print s.encode('utf-8') > > Traceback (most recent call last): > > File "<input>", line 1, in <module> > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position > > 10: ordinal not in range(128) > > > OK, that's pretty much as expected, I know this is not valid utf-8. > > Actually, the string *is* valid UTF-8, but you're confused about encoding and > decoding. Encoding is the process of turning a Unicode object into a byte > string. Decoding is the process of turning a byte string into a Unicode > object. >
...or to put it more simply: encode() is used to covert a unicode string into a regular string. A unicode string looks like this: s = u'\u0041' but your string looks like this: s = 'he Company\xef\xbf\xbds ticker' Note that there is no 'u' in front of your string. Therefore, you can't call encode() on that string. > Also, why are the exceptions above complaining about the 'ascii' > codec if I am asking for 'utf-8' conversion? If a python function requires a unicode string and a unicode string isn't provided, then python will implicitly try to convert the string it was given into a unicode string. In order to convert a given string into a unicode string, python needs to know the secret code that was used to produce the given string. The secret code is otherwise known as a 'codec'. When python attempts an implicit conversion of a given string into a unicode string, python uses the default codec, which is normally set to 'ascii'. Since your string contains non-ascii characters, you get an error. That all happens long before your 'utf-8' argument ever comes into play. decode() is used to convert a regular string into a unicode string (the opposite of encode()). Your error is a 'decode' error(rather than an 'encode' error): UnicodeDecodeError because python is implicitly trying to convert the given regular string into a unicode string with the default ascii codec, and python is unable to do that. -- http://mail.python.org/mailman/listinfo/python-list